Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Need help in regular expression  RSS feed

Maan Suraj
Ranch Hand
Posts: 299
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need some help in the below regex. I am going through below site for learnign regular expressins. So far it has been going good
but kind of stuck with below example

in above link is "Backtracking Into Capturing Groups"

String is <boo>bold</b> and regex is <(A-Z][A-Z0-9]*)[^>.*?</\1>

I know iam getting it wrong, however acc to my understanding , and the articles i have read on regex so far, i felt the regex should have worked like below......Please correct me

Regex ---Token String
<([A-Z][A-Z0-9]*)[^>].*?</\1> <boo>bold

1) < consumes <

2) [A-Z] in round bracket consumes b

3) [A-Z0-9]* in round bracket consumes oo

therefore , first backreference stores boo

4) ^> doesnot match >

Since the above token has star, so it is ok and we proceed to next token of regex. The position of string remains same

5) > consumes > (which is first one in the string)

6) .*? lazy Regex Engine will skip this token as . is lazy

7) < doesnot match b

so engine backtracks to pt 6, and . consumes b. similarly backtracking occurs over and over and . consumes "bold"

8)< consumes < (which is second one)

9)\1 which i think like mentioned in point 3, it must have value boo

boo doesnot match b

10)So engine will backtrack to point 6 and now . will consume "bold<"

11)< doesnot match \b

so enigne backtracks and i guess . will now consume "bold<\b"

but somehow its getting confusing from here ...Could anyone please help...The site mentioned below explains something else....iam unable to get it.....Thanks for your patience in advance

topic in above link is "Backtracking Into Capturing Groups"

Rob Spoor
Posts: 20893
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your regex is incorrect, and does not match your description of it. It is missing the ending > of the start tag. You can verify by adding some more capturing groups and then checking the results:
0: <boo>bold (everything)
1: b ([A-Z][A-Z0-9]*)
2: o (non optional [^>])
3: o>bold (everything up to )

A quick fix in the regex: <([A-Z][A-Z0-9]*)[^>]*>.*?</\\1>
The [^>] is made optional by requiring it 0 or more times, and the closing > is added. If I keep the same capturing groups (around [^>]* and around .*?) the output is then this:
0: <boo>bold
1: b (because you are looking for the end tag )
2: oo ([^>]*)
3: bold (.*?)
Maan Suraj
Ranch Hand
Posts: 299
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry iam not getting it...Actually please see the link and the topic i mentioned in my first post. Also, actually iam more interested in knowing how does the Regex Engine works in the above case and not exactly on the output. Actually, in my first post, i have tried to put down my understanding on token by token basis. I know, it may not be completely correct but its not completely wrong either...

I would like to know on the above lines i.e how the regex works........Thanks for all the efforts put by you in explaining, but if someone could explain taking every token into account, then it may be more helpful so that i can zero in on my error in understanding....

In the site i mentioned in my first post, they say... iam unable to get it

Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.

Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails again.

The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.

Thanks in advance!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!