Quickie columnar data munger

Often we’re testing things with “real” data, e.g., applicant data or similar. Because of privacy concerns, it’s useful to obscure the data in some fashion before working with it. Here’s a very basic way to do that using Python.

The goal is to take a tab-delimited file that has stuff like this:

Ben Chapman 155 Arkansas

and return something like this:

Zjq Haixhexz 155 Qzwdrxx

using a Python script.

Here’s a quickie version of one such script. I’ve added in some very basic file-handling and column selection options, so perhaps this would be helpful to other beginning Python programmers.

#!/usr/bin/env python
##############################
# Quickie python script to read tab-delimited data
# and export munged version of
# same for using as input to demo data analysis program.
#
# Uses ROT13 to 'encode' the data.
# Substitute a better encoding function if you'd like
# Maybe something with random.randrange(52) mapped to UCase/LCase letters or similar.
##############################

import sys
import re
import csv
import string

# Better ROT13 from James Bennett of Django fame:
# http://www.evanfosmark.com/2009/01/rot13-in-python-30/
# Of course, they are talking about 3.0, so we have to use string.*

rot13_trans = string.maketrans('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
 'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')
def rot13(text):
 return text.translate(rot13_trans)

def check_args(arglist):
 ''' Test arg list and provide usage information'''
 if len(arglist[1:]) != 3:
   print "Munge selected columns of input file and return output file."
   print "Columns must be separated by commas (no spaces)."
   print "The first column of data is column 0."
   print
   print 'Usage: %s 1,2,3 INFILE OUTFILE' % arglist[0]
   print
   sys.exit()
 else:
   columns = arglist[1].split(',')
 return columns, arglist[2], arglist[3]

columns, infile, outfile = check_args(sys.argv)
allrows = open(infile, "rb")
rows_tsv = csv.reader(allrows, dialect="excel-tab")
# if you need to discard header, uncomment
# rows_tsv.next()
# put them in a list
rowlist = []
for row in rows_tsv:
 rowlist.append(row)

# Now we have a list of un-obscured names
# Start munging after the first row to preserve the column headers.
# Change to rowlist: if discarding the row header above.
for row in rowlist[1:]:
 for item in columns:
   # cast to integer
   index = int(item)
   # obscure them
   row[index] = rot13(row[index])

outputfile = open(outfile, "wb")
out_csv = csv.writer(outputfile, dialect="excel-tab")
for row in rowlist:
 out_csv.writerow(row)

print "Done!"

The indentation is probably off as a result of WordPress. Let me know if you have questions or suggestions about this or if you need a properly formatted version.

Reblog this post [with Zemanta]
Advertisements