This is one of those posts that is very slightly too long for Twitter and not really long enough to justify a blog post, but here goes anyway.
We get data from our Student Information System in Excel. Imagine something like the following:
last_name, first_name, email, career
Chapman, Ben, firstname.lastname@example.org, LAW
Now imagine that you want to create a MySQL database table and you want to import this data into the table. I use Pentaho Data Integration for this sort of thing. PDI is a great tool and I’ll hopefully go into more detail about it later. In PDI, you create transformations that transform data from one form to another. Here’s an example of a transform that will move data from Excel to a MySQL table:
So, we’re moving from Excel’s untyped data to outputting data for a MySQL data table. PDI will do this for you; however, it will create tinytext columns that are 255 characters wide for each text field that it sees in the Excel file. Let’s assume that you want to match the table definitions more closely to the actual field widths of the data that you are importing.
Sorry, this is sort of a shaggy dog story – all that build up for this:
Here’s what I did: first, I inserted a new row. In cell A1, I entered =max(len(A3:A2000)). I should have actually checked how many rows there are. Obviously, if you’re calculating the maximum of the length of something, you need to include all rows.
Now for the magic. This is Excel 101, but I never use them: array formulas. Highlight the formula, and press Ctl-Shift-Enter. This will insert the braces that you can see in the formula bar above. This will apply the formula calculation to each cell, giving you a very compact way to calculate the maximum length of each cell. Now that you have that maximum, you can create your database field widths with some assurance that you will correctly define the field width. In my case, I often define any field that appears to be less than 10 characters wide as a VARCHAR(10). For wider fields, such as “Descr” above, I would probably define that as a VARCHAR(40).
And that’s that. Just remember Ctl-Shift-Enter for array formulas. To dig deeper into the power of array formulas to summarize data in Excel, take a look at this article from Microsoft. By the way, there are lots of other ways to calculate the maximum length of data in a particular field. There are some very useful tools out there that will do basic data analysis for Excel or CSV files. This was just a convenient way to calculate this information. Other thoughts? Other tools or tips? Let me know in the comments.