• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regular Expression To Parse CSV

 
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi - I need some help to create a regular expression that will parse a line of comma separated values. The problem is that, some of the values have embedded commas that I want to ignore. Here's an example...

100,to_date('18-Jan-2001','dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

Look at the value starting to_date...., I don't want the embedded comma to act as a value separator.

Can anyone help with a Regex I can use on String.split to get an array of the values?

Thanks

Dave.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Unless the point is specifically to do this with regexps, why not use a library that reads CSV files, like this one? CSV has a number of edge cases you need to consider, and before you have implemented all those, you're probably done using ready-made code.
 
Dave Hewy
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes, I would like to do this with regex if possible, before I investigate other methods.
 
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm quite sure that it's not possible.
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think Ilja can say that with confidence because the input you provided is not a "regular language" and cannot be parsed by regular expressions. This distinction gets way over my head in language theory but the shortest and most applicable tip I could find is: "... a language that allows parenthesized expressions, but requires the parentheses to balance, cannot be a regular language, and so the language cannot be generated by a regular grammar ..." from WikiPedia.

I don't think there is a true standard for CSV, but your to_date expression should probably be inside quotes. CSV generators and parsers I've used (eg Excel) use the quotes to know they should ignore the comma in the middle.

"to_date('18-Jan-2001','dd-mon-yyyy')"

Does that give you some ideas on how to parse this stuff? Be sure to consider strings with quotes inside them, too!
 
author
Posts: 23951
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I agree. Regular expressions isn't able to match expressions where you have to keep track of matching closing braces to unlimited depth. Heck, even if you limit the depth, it can get ridiculously complicated.

For example, if I limit the depth to only one set of "()" pairs, the regex becomes...



This should work for your example string, but will fail, if the "to_date" function contains an another function as one of its parameters.

Henry
[ May 18, 2006: Message edited by: Henry Wong ]
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Also be aware that -despite of their name- CSV files can have semicolons instead of commas, and that strings can include newline characters and still be perfectly valid CSV (thus you can't just process the file line by line). Considering all this, go with a ready-made solution
 
Dave Hewy
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.

Having said that, I'm not a regex expert, hence my original post!

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.

Thanks anyway for your replies.

Dave.
 
Henry Wong
author
Posts: 23951
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Can anyone help with a Regex I can use on String.split to get an array of the values?



Oh... the question is for the split() method (and not for the find() method).

For the split() method, it is not possible. The size of the parameters is not even fixed, so you can't even use a combination of zero-width negative look-aheads and look-behinds, to limit the scope of the commas.

Henry
 
Henry Wong
author
Posts: 23951
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.



Well, regular expressions do match most of the time, which makes parsing a breeze. For other cases, regular expressions can be used to match tokens during parsing, making it very easy to write a parser.

Just because the regex engine couldn't match with a single expression, doesn't mean you have to write a parser without it.

Henry
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.



Mathematical expressions can't be parsed purely by using regular expressions, precisely because of the nesting problems described earlier. But regexps can still be helpful in conjunction with other language constructs.

If you're really interested in the theory behind this, you can read up on the Chomsky Hierarchy, and you'll see why regular expressions represent a less powerful language than general mathematical expressions like the one mentioned above.
 
Ranch Hand
Posts: 1923
Scala Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know what the real-world-problem behind the question is.
Perhaps you can solve it in two steps.

to produce:


100,to_date('18-Jan-2001'#'dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'


and then split that.
 
Ilja Preuss
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Dave Hewy:
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.



Regular expressions are a way to describe regular languages. Regular expression APIs use those descriptions to parse "sentences" in the described language.


Having said that, I'm not a regex expert, hence my original post!



It's a fascinating topic.
 
Stan James
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I ran up against this with a little macro language. Where I got lucky is that RegEx can easily find the balanced braces for the innermost nested macro. The macro processor replaces that with something else - plain text or maybe more macros. I keep finding and replacing the innermost until there ain't no mo.

You could do that here ... replace the parens and commas with some escape sequence. But just suggesting that makes me feel dirty.

BTW: If your data file quotes strings that have commas in them, you can do this with regex. Unescaped quotes must match to a depth of exactly one, nesting is not allowed. Look at the beginning of the first/next field. If it starts with a quote, take up to the next unescaped quote & comma or eol, otherwise take up to the next comma or eol. Rinse, repeat.
[ May 19, 2006: Message edited by: Stan James ]
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic