Parsing Non-Standard Data In Excel – A Workaround
This post is a little different from our usual content, but I thought it might be helpful to folks sharing a similar problem when trying to format data in Excel. Sometimes I find it useful myself to read how people solve tricky little Excel issues, so I offer this up in that spirit.
I play with a lot of data. Tons. Every once and awhile I come across a data set that isn’t quite exactly in the format I’d like it to be. When that happens, it often takes the form of a long list of text entries in an Excel file, with individual pieces of data crammed together into single cells. That is, data stuck in individual cells that really ought be parsed out into several separate individual cells.
I recently had a situation where I was working with a data set that included about 8,000 company names, with each company name contained in its own individual cell. Unfortunately, the data was a little garbled inasmuch as each company name also had an annoying, irrelvant number attached to the end of it. For example, the records looked something like this:
“Bushwood Clubhouse 132”; “Fletchco 5”; “Bill and Ted’s Excellent Enterprises 85”
Normally, Excel’s “Text-To-Columns” feature (found under the “Data” tab) works nicely for these type of problems. In a click of a button, that function will automatically parse a data list based on the location of spaces, commas, or pre-defined character lengths. However, in this case that didn’t work perfectly because:
(a) the company names I had in the data list were sometimes one word and sometimes multiple words (i.e. inconsistent spacing), which precluded any simple automated breakup based on spaces; and
(b) the (useless) numbers appearing at the end of each company name ranged from one digit to three digits – also inconsistent – precluding any automated, fixed truncation based on that criterion.
Essentially I needed some logic – more of a “smart” parsing so to speak – as opposed to one that applied a single uniform parsing rule. With IBM’S WATSON unfortunately unavailable to me, I needed a workaround. I used a combination of the IFERROR logic function and the RIGHT, LEFT, LEN string functions, and it worked.
A few disclaimers: I claim no style points at all here; in fact, I’m sure there is probably a more elegant approach. Also, this solution is designed to specifically solve my particular problem with this particular data set; however, I’m hoping that the general approach might help inspire some new ideas for someone out there who may be struggling with a similar (albeit not identical) “non-conforming” data set.
First, assume my list of non-conforming data is in column M of my spreadsheet, starting with cell M1. (So, my company names with varying length and with garbage numbers attached to the end: e.g. “Smith Enterprises 112” are all listed down column M, from M1 to M8000.)
Step 1: First, I created 7 columns to the right of my data set in column M. The first 4 columns to the right contain the following variations of an “IFERROR” formula. Specifically:
enter: “=IFERROR(RIGHT(M1,4)/2,1000)” in cell N1
enter “=IFERROR(RIGHT(M1,3)/2,1000)” in cell O1
enter “=IFERROR(RIGHT(M1,2)/2,1000)” in cell P1
enter “=IFERROR(RIGHT(M1,1)/2,1000)” in cell Q1
Basically, I used this as a way to flag the required truncation of the garbage numbers in any individual data point, because I knew in advance that the longest garbage number set in the data would be 3 digits long (a key point for this custom solution.) I’m using the number 1,000 to help me flag and calculate the required length of the truncation.
Step 2: In cell R1, enter: “=SUM(N1:Q1)”
This total (based on the 1,000s that preceded it), will clue the sheet into how long the truncation needs to be. It can tell how long it should be based on the aggregate number of “1,000” value errors that are returned.
Step 3: In cell S1, enter: “=IF(R1<1000,4,IF(R1<2000,3,IF(R1<3000,2,IF(R1<4000,1,0))))”
This is the final step in “figuring out” how many characters to cut off the right side of the text string contained in column M.
Step 4: Finally, in Column T, enter: “=LEFT(M1,LEN(M1)-S1)”
This actually cuts off the garbage data in M1, based on what the sheet “figured out” the garbage length to be, yielding my company name in correct form without the garbage numbers at the end.
Simply copy all those cells in rows down entire length of your data set and it’ll provide an entire column of data (in column T) of correctly parsed data.
As noted, this solution is customized to solve my particular issue with this particular data set. However, I think the overall message is that I learned that with a little creative use of the IFERROR function, Excel’s string analysis functions (LEFT, RIGHT, and LEN), and some human logic, it can be possible to achieve custom parsings when the standby “Text-To-Columns” feature isn’t quite cutting it.
If you have a better, more efficient way to parse non-standard data in Excel, please share! It’s one of those pesky issues that tends to come up quite often when playing with a lot of data.