Henry Wong wrote:
Here is an example that does not do either ...
Pattern : \w*xx
source : abc 123 yxxx yxxxxxxxxxxxxxxxxxxxxx
The first call to the find() method will return "yxxx". This is *not* the rightmost match. This is *not* the largest match. And this doesn't include the beginning of the source.
Wong is right (and wrong is wrong). I would like to elaborate on this with an example to clarify any confusion. This was also my misinterpretation of the greedy quantifiers and their presumed behavior described in the K&B. As Wong stated, the rightmost match doesn't mean that the entire source text is consumed by the engine, then it starts reading backward (from right to left). Not at all. Taking the Wong's example, with a bit of modification so I can fit it on the same line:
Pattern :
\w*xx
source :
abc 123 yxxx yxxxxxxx
The engine reads from the left (index [0]) to see if it can find a match; "a" fits the
"\w" pattern and since greedy quantifier
"*" looks for "zero or more" of these "words," it matches forward. It carries on with consecutive number of matches (index [1] and [2]) until it hits index [3] which is a "white space", and everything that it has matched so far goes kaput. Remember, the reason it has come this far is because it hadn't matched the first "x" needed to advance further. It starts all over at the index [4], "1"; it matches "\w", and because of "*", it consumes all the way to the index [7], in which, once again, it fails because the character at this index ([7] is a "white space". Everything is discarded.
Index [8] looks promising which satisfies "\w" but still no "x"; it moves to index [9] -- woola! First "x" is matched, moving on to the next index, [10]. It's another "x"; great! I think a match has been found -- "yxx", but this is a greedy quantifier. The search does not end here. It continues on to consume another character, because it needs to feed the beast -- it is its nature. Now, we are at index [11] and what do you know, it's another "x", so our engine feasts on it, because taking into account the previous character at index [10] with index [11] (and everything that it has matched so far, index [8] and [9], "yx"), it now has another "xx" (index [10] and [11] this time) lined up.
It keeps going one character further, but at index [12], it hits a road block; it's a "white space" again, so now it backs one character off. That's what they mean by finding the rightmost match, because, for instance in this case, when the engine backs one character to the left [from the right to left], it, once again, has a match ("yxxx"). So it seizes it, and everyone is happy. The same happens for a match that starts from the index [13], all the way to [20], "yxxxxxxx". Notice, in this case, the engine goes all the way to the end because all the characters consumed to the end (index [20]), from index [13], it now satisfies the pattern.
This is exactly why they are called "greedy" quantifiers, because they eat up as many characters as possible, as they advance to the right, until the engine fails the matching, at which point, the engine works backward to again revert to the position where the matched substring was satisfying the pattern.
Also remember, that the engine starts reading from the left, that's why your first match's start() is [8] and end() is [11], which again, is a proof that the engine doesn't start from index [20] and works its way backward to index [13] as it was implied in the original post.
Now, what happens if we substitute the
"*" quantifier with
"?" and tweak the source (notice a space between "123" and "yxxx" is removed)?
Pattern :
\w?xx
source :
abc 123yxxx yxxxxxxx
Same process takes place. From index [0], the engine begins reading one character at a time but doesn't find any "x" until it encounters a white space at index [3] and starts all over from index [4]. It finds the a match for "\w" but the next index, [5], is a character "2" which is not what it was looking for, "x" (remember, "?" looks for "one or zero" match, so it can't afford having any character other than "x" after it satisfies "\w"). It bumps a reset pointer to the next character at index [5], but index [6] is "3", and the same faith follows it. This process continues until a match at index [8] is found to be an "x" (remember, at this point, the match has starting at index [7]). The next character read is another "x", at index [9]. Now, the "\w?xx" has been achieved, so the first match is [7] to [9] (end() actually shows [10] but this index is exclusive while [7] is inclusive).
The match is reset at index [10] which satisfies "\w" but index [11], a white space, throws the match away. Starting at index [12], the matching finds its one or zero "\w" ("y"), so it attempts to read the next character at index [13], which is an "x". Good! The next character is "x", index [14], and once again, pattern has been matched. So the second match is index [12] to [14] "yxx" (again end() shows [15] but it is exclusive).
Is there any other matches? You bet. Starting at index [15], it reads "x" which is good on "\w". it reads indexes [15] and [16] one character at a time. At that point, it has found another match -- [15] to [17] (end() prints [18] but that's exclusive), "xxx". The last ditch of attempt is to read index [18] which satisfies "\w" pattern, and index [19] which is an "x". At this point, the engine decides that this is also a match because "?" quantifier requires zero or one characters, and in order to satisfy the last two characters of the pattern, "xx", it takes indexes [18] and [19] as its match that has no "\w". We're done.
Performing the same matching with a reluctant quantifier would be different, as expected. Taken the first pattern, albeit modifying it to a reluctant quantifier, we discussed with the second example's source:
Pattern :
\w*?xx
source :
abc 123yxxx yxxxxxxx
This time, rather than indexes from [4] to [10] are declared as a match, the engine reads up to, and including, index [9], but since it is not a "greedy" quantifier, it reluctantly accepts whatever pattern it has successfully matched so far, "123yxx". So with reluctant quantifiers, as soon as a match is realized, it stops and marks it as its match. Continuing through, it reads index [10], which satisfies "\w" and attempts to read index [11], a white space, which fails the patter, and it is reset.
Index [12] seems to do us good with "\w", and the engine consumes the next character at index [13], "x". It has now provided a match for our first "x". Reading index [14], another "x", solidify our pattern, and the engine reluctantly accepts "yxx" -- index [12] to [14]. Once again, as soon as the engine finds its match from the source, it stops and find() returns true.
Starting at index [15], an "x", is read which satisfies "\w". The next character at [16] is read, another "x", and the engine declares "x" a match -- [15] to [16] (end() is [17] but that's exclusive). What happened? Wasn't it suppose to read another character, an "x" at index [17], to get one "x" for "\w" and two others for the end of "\w*?xx" pattern? Well, since the quantifier is
"*", it can match one or
zero characters. Moreover, since it is a "reluctant" quantifier, it prefers the shortest answer, which is "zero character plus 'xx'".
Same reasoning can be made for the next match, index [17] and [18]. Then the lonely "x" at index [19] doesn't have a chance to be of any match. The matching game is over.
I hope this helps.