• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex to parse arguments

 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm working on parsing a string from an RFC, and I can't get my regex to work. So I've written a small Java program to test. I don't understand the results, so I can't figure out what I'm doing wrong.

The applicable section deals with a "type=" string.

The regex that I'm using is:

The specs are that there can be either a series of type=X separated by semicolons,
type=X;type=Y;type=Z
or you can have a series of arguments,
type=X,Y,Z
where the X values are keywords

It seems to work fine for the "type=X;type=Y" model
The output doesn't do a proper greedy match with the series of keywords separated by commas. such as



Thanks
pat
 
Henry Wong
author
Marshal
Pie
Posts: 21437
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Unfortunately, I think you are confusing how regex groups work. Group 1 is always the first parenthesis. Group 2 is always the second parenthesis. etc.

For example, let say you patterns is .... "(hello)*" .... You can match a long string of 100 hello strings. But in terms of the number of groups, it will only be one group -- for the one parenthesis. And it's value will be assigned to the last match of the subgroup.

Henry
 
Henry Wong
author
Marshal
Pie
Posts: 21437
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
type=CELL,pref,msg:(703) 555-8914
gc: 1 = CELL
gc: 2 = ,msg
gc: 3 = msg
gc: 4 = null
gc: 5 = null
gc: 6 = null
gc: 7 = null


So, the first match is CELL, which is the first paren. The second is ",msg" which is the latest match using the second paren (the eariler match of ",pref" is lost). The third match is "msg" which is the latest match using the third paren (the eariler match of "perf" is lost). And all the rest is null because there were no successful sub-matches with parens 4 thru 7.

Henry
 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[quote=Henry Wong]Unfortunately, I think you are confusing how regex groups work. Group 1 is always the first parenthesis. Group 2 is always the second parenthesis. etc. [/quote]

Wouldn't be the first time. My understanding is from my 40 year old study of BNF and formal languages, I've not done much with serious pattern matching using regex in any languages.

[quote=Henry Wong]For example, let say you patterns is .... "(hello)*" .... You can match a long string of 100 hello strings. But in terms of the number of groups, it will only be one group -- for the one parenthesis. And it's value will be assigned to the last match of the subgroup.[/quote]

Do you not get any indication that you matched "hello" vs "hellohellohello"? Both meet the rule.

Do extra parens help?

So if the term is (foo|baz)* does my understanding that foobazbazfoo is not matched?
i.e. foo or baz, repeated as many times as you want?

 
Henry Wong
author
Marshal
Pie
Posts: 21437
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So if the term is (foo|baz)* does my understanding that foobazbazfoo is not matched?
i.e. foo or baz, repeated as many times as you want?


In this case, it does match, but the result is probably not what you are expecting.

Group zero (which haven't been discussed yet), is the true match of the regex, and will match "foobazbazfoo". Group 1 is actually the first subgroup (that the first paren matches). This matches 4 times during this match, and will be assigned to the last submatch, which is "foo".

Do you not get any indication that you matched "hello" vs "hellohellohello"? Both meet the rule.


Well, group zero is different. But you probably mean how would you deal with each "hello". In general, the regex is changed so that find() will return the smaller portion -- probably just a "hello" with a lookbehind or lookahead, to make sure that it is attached to the previous hello, etc. (EDIT: it's probably easier to extract the chain of hellos first, and then use regex again on the chain)

Henry
 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:Group zero (which haven't been discussed yet), is the true match of the regex

Thanks Henry.
I've been playing arround with it, and there seems to be no way to get the unique values of the early parts matchied by the
(foo|baz)*
Getting the last one is easy.

Looks like I'll need to use one regex to identify the substring that matches the final BNF, and then use another to parse/split it into pieces.

Where is snobol when we need it?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic