Regex for parsing CSV data


I have been trying to get this one right for a while now. CSV data keeps popping up time and time again and it is always a pain to have to write code to parse it out. Regex seems likes the perfect answer, but… well… working with Regex is like writing code in APL, suffering amnesia, and then trying to make sense of your code – except with Regex you skip the part about getting amnesia.

At any rate, this is what I have finally come up with:

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

It seems to handle the follow CSV records in the manner I wish:

abc,"def",,"ghi,jkl",'mno',"",pqr
"abc","def",,"ghi,jkl",'mno',"","pqr"
  abc , "def" , , "ghi , jkl" , 'mno' , "" , "pqr"

Each record should parse out to

  1. [abc]
  2. [def]
  3. []
  4. [ghi,jkl] (or [ghi , jkl] in the 3rd case)
  5. [‘mno’]
  6. []
  7. [pqr]

The tricky part in developing this was to make sure that all the data appears in the Group[1] level of the regex results.  Parsing CSV with Regex is easy to do if you are willing to climb up and down group levels.  But I wanted to be able to parse the record and access the fields like an array, without any additional work.  This does prevent it from working with EVERY possible type of data that could be in a CSV record.  But it does work for most general cases.

Breaking the regex down:

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

handles the first field in the CSV (except if it it is a quoted field).

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

captures any fields in the middle of the record (except if it is quoted).

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

handles all quoted fields while skipping the overlaps that occur when 2 quotes surround a comma and other field data (e.g. def”,,”ghi in record 1 and “”,”pqr” in record 2).

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

captures empty quotes.

^(?:([^",]+))?(?=,)|(?<=,)(?:[^",]*)?(?=[,$])|((?<=")[^"(\s*,)][^"]*(?="))|(?<=")(?=")|(?<=,)(?:[^,"])*(?=$)

handles an empty last field in the record (unless it is quoted).

This does make the assumption that the CSV is properly formatted (quotes don’t appear in the middle of data) and that single-quotes are not valid data delimiters.

The regex above can be copy-pasted from here:

Regex CsvRegex = new Regex(
"^(?:([^\",]+))?(?=,)|(?<=,)(?:[^\",]*)?(?=[,$])|((?<=\")[^\"(\\s*,)][^\"]*(?=\"))|(?<=\")(?=\")|(?<=,)(?:[^,\"])*(?=$)",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
var result = sourcestr
.Select(record => CsvRegex.Matches(sourcestr))
.FirstOrDefault()
.Cast<Match>()
.Select(match => match.Value);

And two excellent sites for developing/testing your regex are My Regex Tester and Regular Expressions 101.

Edit:
My original post did not properly handle empty fields at the start and end of the data.  I have changed the regex to properly handle data such as

,abc,"def",,,"ghi,jkl",'mno',"",pqr,,

Edit:
Another change.  I added the quotation mark to the exclusion in the last phrase.  This allows the expression to handle text where the last column is quoted.  e.g.

,abc,"def",,,"ghi,jkl",'mno',"","pqr"
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s