Forums Register Login

Regular Expression To Parse CSV

+Pie Number of slices to send: Send
Hi - I need some help to create a regular expression that will parse a line of comma separated values. The problem is that, some of the values have embedded commas that I want to ignore. Here's an example...

100,to_date('18-Jan-2001','dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

Look at the value starting to_date...., I don't want the embedded comma to act as a value separator.

Can anyone help with a Regex I can use on String.split to get an array of the values?

Thanks

Dave.
+Pie Number of slices to send: Send
Unless the point is specifically to do this with regexps, why not use a library that reads CSV files, like this one? CSV has a number of edge cases you need to consider, and before you have implemented all those, you're probably done using ready-made code.
+Pie Number of slices to send: Send
Yes, I would like to do this with regex if possible, before I investigate other methods.
+Pie Number of slices to send: Send
I'm quite sure that it's not possible.
+Pie Number of slices to send: Send
I think Ilja can say that with confidence because the input you provided is not a "regular language" and cannot be parsed by regular expressions. This distinction gets way over my head in language theory but the shortest and most applicable tip I could find is: "... a language that allows parenthesized expressions, but requires the parentheses to balance, cannot be a regular language, and so the language cannot be generated by a regular grammar ..." from WikiPedia.

I don't think there is a true standard for CSV, but your to_date expression should probably be inside quotes. CSV generators and parsers I've used (eg Excel) use the quotes to know they should ignore the comma in the middle.

"to_date('18-Jan-2001','dd-mon-yyyy')"

Does that give you some ideas on how to parse this stuff? Be sure to consider strings with quotes inside them, too!
+Pie Number of slices to send: Send
I agree. Regular expressions isn't able to match expressions where you have to keep track of matching closing braces to unlimited depth. Heck, even if you limit the depth, it can get ridiculously complicated.

For example, if I limit the depth to only one set of "()" pairs, the regex becomes...



This should work for your example string, but will fail, if the "to_date" function contains an another function as one of its parameters.

Henry
[ May 18, 2006: Message edited by: Henry Wong ]
+Pie Number of slices to send: Send
Also be aware that -despite of their name- CSV files can have semicolons instead of commas, and that strings can include newline characters and still be perfectly valid CSV (thus you can't just process the file line by line). Considering all this, go with a ready-made solution
+Pie Number of slices to send: Send
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.

Having said that, I'm not a regex expert, hence my original post!

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.

Thanks anyway for your replies.

Dave.
+Pie Number of slices to send: Send
 

Can anyone help with a Regex I can use on String.split to get an array of the values?



Oh... the question is for the split() method (and not for the find() method).

For the split() method, it is not possible. The size of the parameters is not even fixed, so you can't even use a combination of zero-width negative look-aheads and look-behinds, to limit the scope of the commas.

Henry
+Pie Number of slices to send: Send
 

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.



Well, regular expressions do match most of the time, which makes parsing a breeze. For other cases, regular expressions can be used to match tokens during parsing, making it very easy to write a parser.

Just because the regex engine couldn't match with a single expression, doesn't mean you have to write a parser without it.

Henry
+Pie Number of slices to send: Send
 

Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.



Mathematical expressions can't be parsed purely by using regular expressions, precisely because of the nesting problems described earlier. But regexps can still be helpful in conjunction with other language constructs.

If you're really interested in the theory behind this, you can read up on the Chomsky Hierarchy, and you'll see why regular expressions represent a less powerful language than general mathematical expressions like the one mentioned above.
+Pie Number of slices to send: Send
I don't know what the real-world-problem behind the question is.
Perhaps you can solve it in two steps.

to produce:


100,to_date('18-Jan-2001'#'dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'


and then split that.
+Pie Number of slices to send: Send
 

Originally posted by Dave Hewy:
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.



Regular expressions are a way to describe regular languages. Regular expression APIs use those descriptions to parse "sentences" in the described language.


Having said that, I'm not a regex expert, hence my original post!



It's a fascinating topic.
+Pie Number of slices to send: Send
I ran up against this with a little macro language. Where I got lucky is that RegEx can easily find the balanced braces for the innermost nested macro. The macro processor replaces that with something else - plain text or maybe more macros. I keep finding and replacing the innermost until there ain't no mo.

You could do that here ... replace the parens and commas with some escape sequence. But just suggesting that makes me feel dirty.

BTW: If your data file quotes strings that have commas in them, you can do this with regex. Unescaped quotes must match to a depth of exactly one, nesting is not allowed. Look at the beginning of the first/next field. If it starts with a quote, take up to the next unescaped quote & comma or eol, otherwise take up to the next comma or eol. Rinse, repeat.
[ May 19, 2006: Message edited by: Stan James ]
I found some pretty shells, some sea glass and this lovely tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com


reply
reply
This thread has been viewed 5344 times.
Similar Threads
Inserting the Date in to MySQL
Regular Expression Matching in Eclipse 3.1
Regular Expresssions
How to replace comma (,) with a dot (.) using java
String.split() question
More...

All times above are in ranch (not your local) time.
The current ranch time is
Apr 16, 2024 10:22:09.