• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

doubt on group() in Matcher class

 
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator



i have two doubts regarding this program...a match having zero or more digits
  • 1st match is found at index-0 as it has zero digits...but why is the matcher returning a empty string when the actual match is "a"...the function on group is to return the match found isn't it???
  • im totally confused with the execution part ...can any explain how the match is done


  •  
    Ranch Hand
    Posts: 105
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    \\d metacharacter matches digits and * quantifier which is greedy quantifier means zero or more digits. That's why you're getting empty string here when there is no match found. use the + quantifier instead.
    [ December 06, 2008: Message edited by: Pawan Arora ]
     
    Sheriff
    Posts: 9707
    43
    Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    well this is the behavior of greedy quantifiers that I observed. But I didn't experiment on it much so it may be wrong

    when you use * with a pattern matcher like \d or \w, then it becomes reluctant to find the matching pattern. It will start matching zero length matches.

    But when you use * with dot (.), then it becomes greedy. It tries to match the . with as much characters that it can. So if you try to find .*\\d, it starts to search from the right and matches the first digit that it finds...
     
    saipavan vallabhaneni
    Ranch Hand
    Posts: 34
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    thanks ankit and pawan,

    Ankit ..as the greedy quantifiers read the entire source string and start back from right most for a match...so i was wondering how the start method printed out 0 in the 1st place..because 5 must have been printed as "f" is a perfect match as it has got 0 digits in it...
    i am really confused with this...

    can anyone elaborate on the execution sequence???
     
    Ankit Garg
    Sheriff
    Posts: 9707
    43
    Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    saipavan you got my point wrong. If you use this

    \\d*

    then it will look for zero or more occurrences of any digit. It will look into the string



    It will find zero occurrences of a digit at index 0,
    then it will find zero occurrences of a digit at index 1,
    then it will find two occurrences of a digit at index 2,
    then it will find zero occurrences of a digit at index 4,
    then it will find zero occurrences of a digit at index 5,
    then it will find zero occurrences of a digit at index 6.

    I hope this clears your doubt...
     
    saipavan vallabhaneni
    Ranch Hand
    Posts: 34
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    thanks ankit,

    but is it not true that greedy quantifier looks at the entire source string once and then reverts back from right to find the match and include the part of the source left side to the match as the final match...

    source: yyxxxyxx
    regex: .*xx
    output: yyxxxyxx(at match is found at the end and part source string prior to the match is included in the output as the entire source ends in a xx)
     
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    but is it not true that greedy quantifier looks at the entire source string once and then reverts back from right to find the match and include the part of the source left side to the match as the final match...



    Keep in mind that there are two things going on here. First, the regex, which includes a greedy qualifier, which will try to match as much as possible, backing down only if it fails to match.

    And Second, is related to the logic of the find() method. The find() method determines the start of the string to match. It "finds" matches from the start of the string to the end of the string, applying the regex -- returning matches that it finds.

    Henry
    [ December 06, 2008: Message edited by: Henry Wong ]
     
    Henry Wong
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Using this example -- applying the principles from the previous post...



    The find() method will start at index 0, and apply the regex. The regex will greedily match the whole string with ".*" portion of the regex -- but must then back down the last two letters, so that the "xx" part of the regex could also match.

    On the next call, the find() method will then start at the end of the previous match, which is at index 8, and apply the regex. The regex will fail to match -- the ".*" portion can match (zero characters), but the "xx" portion can't match. So, the find method will return false.

    Henry
    [ December 06, 2008: Message edited by: Henry Wong ]
     
    saipavan vallabhaneni
    Ranch Hand
    Posts: 34
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    thanks henry,
    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())
     
    Ankit Garg
    Sheriff
    Posts: 9707
    43
    Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I would still stick to my words. If you use * with a dot(.), then * will become greedy. But if you put * with a pattern, then * will be reluctant.

    See this example



    Just compile and run this program and you will see what I am trying to say...
     
    saipavan vallabhaneni
    Ranch Hand
    Posts: 34
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    thanks Ankit,
    i am now a little aware of working of the mehods ...
     
    Henry Wong
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Originally posted by saipavan vallabhaneni:
    thanks henry,
    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())



    First of all, it is *not* returning null. It is returning a zero length string -- which is what was matched. And BTW, how can it return "a"? That doesn't even match!! But here is the complete explanation...

    The find() method will start at index 0, and apply the regex. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 0, but at minimum, it increments the index by 1, so it starts at index 1. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 1, but at minimum, it increments the index by 1, so it starts at index 2. The regex does find digits at this location, and greedily matches all of it -- and matches "34".

    On the next call, the find() method will start at the end of the previous match, which is at index 4, and apply the regex. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 4, but at minimum, it increments the index by 1, so it starts at index 5. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 5, but at minimum, it increments the index by 1, so it starts at index 6. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    Also, note that this index if at the end of the string. This is allowed because technically, it is possible to have a zero length string at the end of the string. Weird, but true.

    On the next call, the find() method should start at the end of the previous match, which is at index 6, but at minimum, it increments the index by 1, so it starts at index 7. The regex can't match anything as this this location doesn't exist. It can't even match with a zero length string -- because this exceeds the length of the string.

    Henry
    [ December 07, 2008: Message edited by: Henry Wong ]
     
    Henry Wong
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    I would still stick to my words. If you use * with a dot(.), then * will become greedy. But if you put * with a pattern, then * will be reluctant.



    Can you elaborate what you mean by this statement?

    Greedy means to match as much as possible, but back down (match less), if it causes the overall regex to fail. Reluctant means to match as little as possible, but match more, if it causes the overall regex to fail. Whether a regex is greedy or reluctant is based on the quatifier -- not what is being matched.

    Henry
    [ December 07, 2008: Message edited by: Henry Wong ]
     
    Ranch Hand
    Posts: 952
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())



    As regex is "\\d*", group() method is trying to fing digits, and "a" is not digit.
    if "a" was digit, then sure it must have been return "a".
    But group() method finds zero digits means no digits at index 0, so it is returning null.
    [ December 07, 2008: Message edited by: Punit Singh ]
     
    Ankit Garg
    Sheriff
    Posts: 9707
    43
    Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Hi Henry I backed what I said with an example. If I search for
    .*\\d
    in
    1bxfdsx3xss5

    then it would match the last 5 as .* would be greedy. But if you search for
    \\d*
    in
    1bxfdsx3xss5

    then it would match as little as possible. So it would give you empty matches at index 1,2,3,4 etc. This is what I was trying to say. I may be wrong as I said earlier also that I have not experimented on this much...
     
    Henry Wong
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    then it would match as little as possible. So it would give you empty matches at index 1,2,3,4 etc. This is what I was trying to say. I may be wrong as I said earlier also that I have not experimented on this much...



    No... This is not what greedy means. Greedy doesn't mean that it matches a lot of stuff. Those empty matches at index 1, 2, etc., are greedy matches -- it is trying to match as much as possible, but there is simply little to match.

    I'll give a better example between greedy and reluctant in my next post.

    Henry
     
    Henry Wong
    author
    Posts: 23951
    142
    jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Let's use an example mentioned in this topic...


    source: yyxxxyxx
    regex: .*xx



    This will do a greedy match of any character, and then match "xx" at the end.... with a call to find()... The ".*" portion is greedy, and hence, will try to match everything. However, it must back down two characters, because if it didn't, the "xx" portion of the regex would not match.

    Basically, the ".*" portion of the regex will match "yyxxxy", while the whole regex will match the whole string.

    Let's change the example to use a reluctant qualifier...


    source: yyxxxyxx
    regex: .*?xx



    This will do a reluctant match of any character, and then match "xx" at the end.... with a call to find()... The ".*?" portion is reluctant, and hence, will try to match as little as possible -- match zero characters. However, it must match the two "y" characters, because if it didn't, the "xx" portion of the regex would not match.

    Basically, the ".*?" portion of the regex will match "yy", while the whole regex will match "yyxx" -- for find at index zero. The reluctant portion match the bare minimum to allow the whole regex to match.

    Henry
     
    Greenhorn
    Posts: 6
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Thank you very much for detailed explanation. I was getting sick of zero length concept, but Henry's posts clarified everything.
     
    Don't get me started about those stupid light bulbs.
    reply
      Bookmark Topic Watch Topic
    • New Topic