• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular Expression To Parse CSV

 
Dave Hewy
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi - I need some help to create a regular expression that will parse a line of comma separated values. The problem is that, some of the values have embedded commas that I want to ignore. Here's an example...

100,to_date('18-Jan-2001','dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

Look at the value starting to_date...., I don't want the embedded comma to act as a value separator.

Can anyone help with a Regex I can use on String.split to get an array of the values?

Thanks

Dave.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unless the point is specifically to do this with regexps, why not use a library that reads CSV files, like this one? CSV has a number of edge cases you need to consider, and before you have implemented all those, you're probably done using ready-made code.
 
Dave Hewy
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, I would like to do this with regex if possible, before I investigate other methods.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm quite sure that it's not possible.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think Ilja can say that with confidence because the input you provided is not a "regular language" and cannot be parsed by regular expressions. This distinction gets way over my head in language theory but the shortest and most applicable tip I could find is: "... a language that allows parenthesized expressions, but requires the parentheses to balance, cannot be a regular language, and so the language cannot be generated by a regular grammar ..." from WikiPedia.

I don't think there is a true standard for CSV, but your to_date expression should probably be inside quotes. CSV generators and parsers I've used (eg Excel) use the quotes to know they should ignore the comma in the middle.

"to_date('18-Jan-2001','dd-mon-yyyy')"

Does that give you some ideas on how to parse this stuff? Be sure to consider strings with quotes inside them, too!
 
Henry Wong
author
Marshal
Pie
Posts: 21446
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree. Regular expressions isn't able to match expressions where you have to keep track of matching closing braces to unlimited depth. Heck, even if you limit the depth, it can get ridiculously complicated.

For example, if I limit the depth to only one set of "()" pairs, the regex becomes...



This should work for your example string, but will fail, if the "to_date" function contains an another function as one of its parameters.

Henry
[ May 18, 2006: Message edited by: Henry Wong ]
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also be aware that -despite of their name- CSV files can have semicolons instead of commas, and that strings can include newline characters and still be perfectly valid CSV (thus you can't just process the file line by line). Considering all this, go with a ready-made solution
 
Dave Hewy
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.

Having said that, I'm not a regex expert, hence my original post!

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.

Thanks anyway for your replies.

Dave.
 
Henry Wong
author
Marshal
Pie
Posts: 21446
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can anyone help with a Regex I can use on String.split to get an array of the values?


Oh... the question is for the split() method (and not for the find() method).

For the split() method, it is not possible. The size of the parameters is not even fixed, so you can't even use a combination of zero-width negative look-aheads and look-behinds, to limit the scope of the commas.

Henry
 
Henry Wong
author
Marshal
Pie
Posts: 21446
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.


Well, regular expressions do match most of the time, which makes parsing a breeze. For other cases, regular expressions can be used to match tokens during parsing, making it very easy to write a parser.

Just because the regex engine couldn't match with a single expression, doesn't mean you have to write a parser without it.

Henry
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.


Mathematical expressions can't be parsed purely by using regular expressions, precisely because of the nesting problems described earlier. But regexps can still be helpful in conjunction with other language constructs.

If you're really interested in the theory behind this, you can read up on the Chomsky Hierarchy, and you'll see why regular expressions represent a less powerful language than general mathematical expressions like the one mentioned above.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know what the real-world-problem behind the question is.
Perhaps you can solve it in two steps.

to produce:

100,to_date('18-Jan-2001'#'dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

and then split that.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Dave Hewy:
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.


Regular expressions are a way to describe regular languages. Regular expression APIs use those descriptions to parse "sentences" in the described language.


Having said that, I'm not a regex expert, hence my original post!


It's a fascinating topic.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I ran up against this with a little macro language. Where I got lucky is that RegEx can easily find the balanced braces for the innermost nested macro. The macro processor replaces that with something else - plain text or maybe more macros. I keep finding and replacing the innermost until there ain't no mo.

You could do that here ... replace the parens and commas with some escape sequence. But just suggesting that makes me feel dirty.

BTW: If your data file quotes strings that have commas in them, you can do this with regex. Unescaped quotes must match to a depth of exactly one, nesting is not allowed. Look at the beginning of the first/next field. If it starts with a quote, take up to the next unescaped quote & comma or eol, otherwise take up to the next comma or eol. Rinse, repeat.
[ May 19, 2006: Message edited by: Stan James ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic