Choosing some random records from a CSV report in Python

Here’s an example of a quick-n-dirty data extraction performed in Python. I have a report that is approximately 4000 lines of information. In each line, field 5 contains a student name and field 11 contains a student email address. Student emails and names may be repeated in the original file. The goal was to produce a random list of 30 members of the original file in this format:

name <emailaddress>
name2 <emailaddress2>

name30 <emailaddress30>

Here’s one way to accomplish this in Python.

This is very basic and I didn’t even bother writing it so that it could be run in “discovery” mode (figuring out what fields contain what data) and “operation” mode. If you want to get a numbered list of each “field” in the file, you need to uncomment the line that says – getfields() and then comment out the remainder of the program. Anyway, I’m throwing this up as an example of very quick coding done to solve a particular problem. Think of it as a replacement for sh or awk coding.

1 #!/usr/bin/python
2
3 import csv

4 import random
5 # Name is field 5
6 # email is field 11
7
8 FILE = /Users/bchapman/downloads/EU_SR_LAW_CLASS_ROLLS_W_ID.csv

9 # create a csv object from the file
10 mycsv = csv.reader(open(FILE))
11
12 def get_fields(csvfile = mycsv):
13 Run this to find out what fields are in the csv object
14 counter = 0

15 line = mycsv.next()
16 for item in line:
17 print counter,item
18 counter += 1
19

20 def get_emails(csvfile = mycsv, n = 5, e = 11):
21 Parse csv file and return unique list of dicts containing fields we want
22 students = []
23 name_field = n
24 email_field = e

25 for row in csvfile:
26 # assumes that we’ve already thrown away the header line”
27 student = {}
28 email, name = row[email_field], row[name_field]
29 name = name.replace(,, )

30 student[email] = email
31 student[name] = name
32 # Make sure we’re only adding unique students

33 # BTW, this is another example of Python’s awesomeness
34 if student not in students:
35 students.append(student)
36 return students

37
38 # Commented out because we already know what fields we want
39 #get_fields()
40
41 # Now we discard header row
42 mycsv.next()
43

44 biglist = get_emails()
45 # Let’s pick 30 at random
46 for i in range(30):
47 chosen = random.choice(biglist)
48 # format as a list that can be added into an email

49 print %s <%s> % (chosen[name], chosen[email])
50 # remove each chosen name from the big list so we don’t

51 # get them twice.
52 biglist.remove(chosen)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s