One of my co-workers brought me an interesting problem. He had downloaded a page of data from Martindale-Hubbell about a local law firm and was looking for a fairly efficient way to convert that data into something that could be loaded into a database. Fortunately, he found a pay solution involving a commercially-prepared DVD before I spent too much time on the project.
However, I did want to document the kinds of steps that you might take to begin to address a problem like this. Here’s a chunk of the raw data:
ALSTON & BIRD LLP One Atlantic Center 1201 West Peachtree Street Atlanta, Georgia 30309-3424 (DeKalb & Fulton Counties) Telephone: XXX-XXX-XXXX Telecopier: XXX-XXX-XXXX (MAIN OFFICE) PEER REVIEW RATING: AV PRACTICE: Administrative Law, Alternative Dispute Resolution, Antitrust and Trade Regulation... FIRM-PROFILE: Year Established: 1893 FIRM-ALLIANCES: LEX MUNDI PERSONNEL: Members of Firm: Randall [lots more text snipped]
So, how would you begin to tackle this, even on a very basic level? One of the things that jumps out is that there are “fields” that start with a newline, some characters, and end with a colon (“:”) character, for example, “PRACTICE:”, which is then followed by a significant amount of text regarding areas in which the firm practices.
So, to start to play with this problem, you could fire up the interactive python interpreter; I use ipython, but you could use the standard Python shell as well. It’s pretty clear that you’ll need the “re” regular expression module, so import that first and then load up the data that you are going to play with:
mydata = open("/Users/bchapman/Desktop/alstonandbird2.txt").read()
The variable “mydata” now contains a very long string (confirmed by checking its length) containing the entire contents of the supplied text file. Note that Python is perfectly happy to have newline characters within a string and for purposes of this project, one long string, which we’re later going to split up, is exactly right for this task.
The next task is to decide how to split up the text. We’ll take a very crude first-cut approach. We’re going to search for sequences of text that start with “n” (end of line) and are then followed by at least one character and completed with a “:”. Here’s a regular expression that does that:
regex = re.compile("(nw.*?:)")
The only odd thing about this regex is that it is “non-greedy”, that is that it only goes until it finds the shortest possible match on the line. This is indicated by the “?” atom and it helps us avoid issues with lines like this:
FIRM-PROFILE: Year Established: 1893
We only want to match “FIRM-PROFILE:”, not “FIRM-PROFILE: Year Established:”. Finally, the whole regex is enclosed in parentheses characters. Here’s why we do that. We want to “save” the contents of the match for the next step. This leads to another useful Python feature. We can easily split the string into a list based on the matched string portions. The string that looks like this:
"nPROFILE: some text herenADDRESS: some more text heren"
can be turned into this:
["nPROFILE:","some text heren","ADDRESS:","some more text heren"]
with one line of code:
rawdata = re.split(regex, mydata)
This takes the compiled regular expression named “regex” and splits the string mydata into the list “rawdata”, preserving the the ‘field’ names in the output list. What’s even more fun is that the list doesn’t have to start with a field name. This perfectly fits the data provided. At this point, we have a list with a lot of elements. The first element of the list has no field name. We’ll call it “preamble” and split it off from our fielded data:
preamble = rawdata
rawdata = rawdata[1:]
With the above, we’ve chopped off the unkeyed preamble. Now rawdata is a series of key-value pairs, like “PROFILE:”,”Civil litigation …”. That looks a lot like it should be stored in a Python dictionary or associative array. Here’s one way to get there that shows off a feature that I don’t use very often at all. It’s presented as a complete Python function.
def dictmaker(mylist): '''Convert a list to a dict and return the dict''' keys = mylist[0::2] values = mylist[1::2] mytuples = zip(keys,values) mydict = dict(mytuples) return mydict
We now that the “keys” in the list are every other element, starting with the first element (mylist) and we know that the “values” in the list are every other element, starting with the second element (mylist). Python lists take a start and end value, but they also take a “step” value. So the notation mylist[0::2] does exactly what we need to extract a list of all of the keys. Values are similarly extracted. We just start the count at 1. Once we have the values, we can convert the two lists to a dictionary using
dict. Incidentally, the keys are somewhat awkward looking: “nPERSONNEL:”. You may want to clean them up, perhaps by doing something like:
p = p[1:-1], which trims the first and last character from the string
p, leaving you with “PERSONNEL”, which looks better.
After that, well, you’re on your own… Oh, and for “real” approaches to text-processing in Python, I strongly recommend David Mertz’s book “Text Processing in Python.” While it’s available online, I actually bought a copy. It’s helpful with all sorts of things, including a fairly painless introduction to state machines in Python.
Questions or comments? Let me know.