• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex confusion: System.out.println(Pattern.matches("(xx)*", "x")); prints "false"  RSS feed

 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
As per the subject, the following code yields the output "false":



This, I find odd, and would like to understand why. I would expect (xx)* to mean, match zero or more instances of the String "xx". So, if there are zero instances, I would expect a match. Please could some regex guru explain what is actually happening here?
Many thanks in advance,
SS
 
Campbell Ritchie
Marshal
Posts: 56534
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Maybe that would match half an instance. Please check that the brackets do not form part of the actual text; I think they do not.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The "xx" are in a group and the * operates on the group so it will match "" or "xx" or "xxxx" or "xxxxxx" etc i.e. an even number of x and not an odd number.
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:The "xx" are in a group and the * operates on the group so it will match "" or "xx" or "xxxx" or "xxxxxx" etc i.e. an even number of x and not an odd number.


In which case, why does it not match either the beginning or end of "x"? I.e. if you imagine "x" = "" + "x" + "" (or some combination thereof)?

CR, yes, the parentheses are part of the regex syntax, not literals.


Thanks.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steven Squeers wrote:
Richard Tookey wrote:The "xx" are in a group and the * operates on the group so it will match "" or "xx" or "xxxx" or "xxxxxx" etc i.e. an even number of x and not an odd number.


In which case, why does it not match either the beginning or end of "x"? I.e. if you imagine "x" = "" + "x" + "" (or some combination thereof)?


There is an implied ^ at the start of the regex and an implied $ at the end so the whole string has to match. You need to use Matches.find() to match just part of the string.
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Richard. I think I'm going to have to just accept that it is the way it is. I still don't understand why.
 
Matthew Brown
Bartender
Posts: 4568
9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, this is the documentation for Pattern.matches(): java.util.regex.Pattern#matches(java.lang.String, java.lang.CharSequence)

It says it behaves the same as java.util.regex.Matcher#matches(), and that says:

Attempts to match the entire region against the pattern.


So it's checking to see if the string exactly matches the pattern, not if it contains the pattern. As Richard said, there's another method find() which looks for matches against substrings.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steven Squeers wrote:Thanks Richard. I think I'm going to have to just accept that it is the way it is. I still don't understand why.


Don't just accept it; understand it! If you know what ^ and $ do in a regex then knowing that matches() implies a leading ^ and a trailing $ then the result makes perfect sense so concentrate on what ^ and $ do.
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:
Steven Squeers wrote:Thanks Richard. I think I'm going to have to just accept that it is the way it is. I still don't understand why.


Don't just accept it; understand it! If you know what ^ and $ do in a regex then knowing that matches() implies a leading ^ and a trailing $ then the result makes perfect sense so concentrate on what ^ and $ do.


I'm well aware of ^ and $, although I wasn't aware they are implied by matches(). However:



produces output:

false
false
true
true

In other words, * will match ZERO occurrences of the preceding expression when there is a match PRECEDING the preceding expression (eg in the case of "az*": the the expression preceding the preceding expression is "a"), but it will NOT match zero occurrences of the preceding expression when there is NOT a match of the expression preceding the preceding expression. It also matches zero occurrences of the preceding expression when there is no content in the string to be searched, but that is another use case entirely.

This, to me, seems to be an inconsistency in the behaviour of the metacharacter *, and unfortunately I have yet this inconsistency explained to my satisfaction. I can't see what ^ and $ have to do with it. Can you explain why the behaviour of * is different in the code fragments, and/or what ^ has to do with it?

Thanks,
SS
 
Matthew Brown
Bartender
Posts: 4568
9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
But try these:

They'll both return true. It's not the order that matters. The reason your first doesn't work is that if "(xx)*" matches zero occurrences, you've then got an empty string trying to match a single "x". Which it can't. In your second case "z*" will match zero "z"s, but there's still nothing to match the "x". The pattern has to match the entire string. When matching "az*" against "a" it does - the "a" matches the "a" and the "z*" matches the empty string.
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Damn, just lost my edit because I hit backspace when the focus wasn't on the text box...doh!
Anyway, I think I figured it out.
Any regex which consists solely of an expression followed by *, implicitly means, match only an empty string or one or more occurrences of the specified pattern. I think I see what Campbell Richie meant now: the expression "(xx)*" can only match "" or "xx" or "xxxx" etc.

Cheers.
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Matthew, thanks all.
Now I understand.
Cheers
SS
 
Steven Squeers
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:
Steven Squeers wrote:Thanks Richard. I think I'm going to have to just accept that it is the way it is. I still don't understand why.


Don't just accept it; understand it! If you know what ^ and $ do in a regex then knowing that matches() implies a leading ^ and a trailing $ then the result makes perfect sense so concentrate on what ^ and $ do.


Actually now I realise what you mean and it seems obvious: "(xx)*" --> "^(xx)*$" therefore there's no way that "x" can match that pattern. Thanks.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!