• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regular Expression matching with Java  RSS feed

 
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to use a regular expression in String.matches(), but it doesn't appear to work with the characters '+' or '%'. I am escaping these characters with two backslashes, although I thought I should only use a single backslash to escape these characters (Eclipse won't let me compile with a single backslash, claiming that only \b \t \n \f \r \' \" \\ are valid escape sequences).
For example a price String variable has a value of "+12.55%".
Neither price.matches("\\+") nor price.matches("\\%") will return true, even though both of these should match.
Can anyone see what the error is here ? If the trouble is in fact the problem of having two backslashes instead of one then how can I get around the restriction in Eclipse ?
Thanks in advance for any help.

-James
 
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi James,
The problem is that String.Matches requires an exact match: thus, \\+.*\\%
or, to be more exact for your apparent needs:
\\+?\\d+\\.?\\d{0,2}\\%
HTH,
M
[ May 03, 2004: Message edited by: Max Habibi ]
 
James Adams
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the feedback Max. I was mistakenly under the impression that String.matches() would return true if any match was found, and not just if there is a match of the complete string.
It has been suggested to me that for a match of the regular expression anywhere in the String I should instead use
Pattern.compile("regular expression").matcher("my String").find()
This seems to work as advertised.

-James
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by James Adams:
Thanks for the feedback Max. I was mistakenly under the impression that String.matches() would return true if any match was found, and not just if there is a match of the complete string.
It has been suggested to me that for a match of the regular expression anywhere in the String I should instead use
Pattern.compile("regular expression").matcher("my String").find()
-James

ug. I would recommend the previous approach, where you describe the parts of the string lead up to and after the pattern. This could really improve your performance, because (depending on how well you know your data), invalid matches can be eliminated fairly quickly.
All best,
M
 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using 'find' is the right approach. 'match' is kind of silly misleading short cut for using the "^" and "$" anchors ("^" = "beginning of string" and "$" = "end of string"). You usually don't want to try to enumerate everything that could occur before what you are trying to match and if you do you can use "^" and "$". It is misleading to say that enumerating everything that could occur from the start of string to what you are trying to match is more efficient. Especially if it could be anything (".*"), in which case it will probably actually be a tiny bit slower. If there are important parts that need to match before your pattern, than those parts are part of your pattern.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Kevin Fletcher:
Using 'find' is the right approach. 'match' is kind of silly misleading short cut for using the "^" and "$" anchors ("^" = "beginning of string" and "$" = "end of string"). You usually don't want to try to enumerate everything that could occur before what you are trying to match and if you do you can use "^" and "$". It is misleading to say that enumerating everything that could occur from the start of string to what you are trying to match is more efficient.

Actually Kevin, you're mistaken. The most efficient expressions adhere closely to the data: this eliminates noise that might lead the Engine on wild goose chases.
For example, consider the string
"Hello, my name is John Doe"
The most efficient pattern to extract "Doe" from this sentence is along the lines of

trivial, but true. The next most efficient pattern is probably along the lines of

and so on.
Why are these the most efficient paths? For the same reason that you crave specific requirements in your design. They tell the Engine exactly what to do, and what it doesn't have to do. When the Engine sees that it needs to find a specific sequence of characters before it attempts any wild cards, then it can stop looking when it fails to find those specific characters. Less memory used, fewer CPU cycles, and more efficiency is achieved.
To wit, when the engine considers the candidate String
"Hello, my sister is Tom"
it can stop looking as soon as finds the first 's' in 'sister', because it was expecting the 'm' in 'my'.
But don't take my word for it: set up a test, and measure the performance in terms of memory and cpu time.
HTH,
M
[ May 06, 2004: Message edited by: Max Habibi ]
 
Kevin Fletcher
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, sure it is more efficient if you know the exact string from the beginning of the string, but if you know that why would you be using a regular expression, you should just use subString . If you only know "kinda" what comes before, then you are going to have to write wildcards anyway and waste time trying to match them. Besides if you really cared about doing regular expression matching efficiently, you would use a language with real regex support like perl or awk.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Kevin Fletcher:
Well, sure it is more efficient if you know the exact string from the beginning of the string, but if you know that why would you be using a regular expression, you should just use subString .

Not really: consider the very realistic example below


If you only know "kinda" what comes before, then you are going to have to write wildcards anyway and waste time trying to match them.

My point is that you want as many 'real' characters in your expressions as you can get.

Besides if you really cared about doing regular expression matching efficiently, you would use a language with real regex support like perl or awk.
This is tricky, but I'll go out on a limb and say that Java has 'real' regular expression support. By and large I prefer it to perl and/or awk. It's more clear and intuitive to me, and I enjoy using Java's convention for conditional logic over either's perl's or awk's.
Can you achieve as much with a single line of obscure code? Probably not: however, that's never been my criteria for goodness.
All best,
M
 
Sheriff
Posts: 7023
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Kevin Fletcher:
if you really cared about doing regular expression matching efficiently, you would use a language with real regex support like perl or awk.


Yeah, right.

http://dada.perl.it/shootout/regexmatch.html

If your concern is cpu use, for regular expression matching, perl was barely faster than Java, while two flavors of awk were about four and seven times slower.
[ May 29, 2004: Message edited by: Dirk Schreckmann ]
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!