but kind of stuck with below example
in above link is "Backtracking Into Capturing Groups"
String is <boo>bold</b> and regex is <(A-Z][A-Z0-9]*)[^>.*?</\1>
I know iam getting it wrong, however acc to my understanding , and the articles i have read on regex so far, i felt the regex should have worked like below......Please correct me
Regex ---Token String
1) < consumes <
2) [A-Z] in round bracket consumes b
3) [A-Z0-9]* in round bracket consumes oo
therefore , first backreference stores boo
4) ^> doesnot match >
Since the above token has star, so it is ok and we proceed to next token of regex. The position of string remains same
5) > consumes > (which is first one in the string)
6) .*? lazy Regex Engine will skip this token as . is lazy
7) < doesnot match b
so engine backtracks to pt 6, and . consumes b. similarly backtracking occurs over and over and . consumes "bold"
8)< consumes < (which is second one)
9)\1 which i think like mentioned in point 3, it must have value boo
boo doesnot match b
10)So engine will backtrack to point 6 and now . will consume "bold<"
11)< doesnot match \b
so enigne backtracks and i guess . will now consume "bold<\b"
but somehow its getting confusing from here ...Could anyone please help...The site mentioned below explains something else....iam unable to get it.....Thanks for your patience in advance
topic in above link is "Backtracking Into Capturing Groups"
0: <boo>bold (everything)
1: b ([A-Z][A-Z0-9]*)
2: o (non optional [^>])
3: o>bold (everything up to )
A quick fix in the regex: <([A-Z][A-Z0-9]*)[^>]*>.*?</\\1>
The [^>] is made optional by requiring it 0 or more times, and the closing > is added. If I keep the same capturing groups (around [^>]* and around .*?) the output is then this:
1: b (because you are looking for the end tag )
2: oo ([^>]*)
3: bold (.*?)
I would like to know on the above lines i.e how the regex works........Thanks for all the efforts put by you in explaining, but if someone could explain taking every token into account, then it may be more helpful so that i can zero in on my error in understanding....
In the site i mentioned in my first post, they say... iam unable to get it
Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.
Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails again.
The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.
Thanks in advance!