Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Java regular expression optimization - help needed

 
dpkcv cv
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
I am new to Java Regular expression. We are using a pattern for matching a string. We are using this for validating a text field and it meets our requirements. But there is a performance issue in the matching.
pattern :([a-zA-Z0-9]+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+

1. Input text should start with a-zA-Z0-9.
2. Space(single) is allowed between words
3. "_" and "-" are allowed but cannot be consecutive.

Our problem is, for certain input strings the CPU time goes high and causes hanging the threads. Also we get exceptions. Can anyone please
help me to optimize the Pattern or suggest a new pattern to solve my issue.

Exception details
============================================
Hung thread details, all the same:
[9/28/11 11:40:07:320 CDT] 00000003 ThreadMonitor W WSVR0605W: Thread "WebContainer : 26" (0000004f) has been active for 709755 mi
lliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3938)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4323)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4263)
at java.util.regex.Matcher.match(Matcher.java:1139)
at java.util.regex.Matcher.matches(Matcher.java:514)

Thanks
Deepak
 
Gupta Tarun
Greenhorn
Posts: 22
Hibernate MyEclipse IDE Postgres Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How about doing a negative check on "_ and - cannot be consecutive" i.e. should not contain '__' and '_-' and '--' and '-_'.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Our problem is, for certain input strings the CPU time goes high and causes hanging the threads. Also we get exceptions.

That's not very helpful. When do you get high CPU? It would also be nice to see the relevant piece of code.

Also, as far as I can see, the pattern will allow something like "a_ - _ _ -". Is that what you want?

My suggestion: Write down the syntax rules in full in English; then write a pattern to match. Regexes are hard enough without trying to reverse-engineer them. It will also help anyone else that has to look at your code to understand your intentions.

I have no idea whether it's part of your problem, but the "+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)" looks rather messy to me.

From your own definition, I would say that unless a single "_" or "-" is allowed on its own, delimited by spaces, a word has the pattern:
"[-_]?([a-zA-Z0-9]+[-_]?)+"
By inference, the overall pattern would then be:
"[a-zA-Z0-9]([ ]?[-_]?([a-zA-Z0-9]+[-_]?)+)+"

But you need to test it well. Just as an added point, I tend to avoid the '*' qualifier if I can. I also try to avoid '\', because they drive me nuts in Java strings.

Winston

PS: You might also be able to improve the performance by making your groups non-capturing.
 
Rob Spoor
Sheriff
Pie
Posts: 20667
65
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you look at that exception stack trace you can see that there's an infinite loop in there somewhere. Without seeing the input that caused this loop we can't really tell you why this infinite loop is there.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think I just worked out your problem, and if I'm right it does have to do with the '*' qualifier (among other things).
([_\-][a-zA-Z0-9 ])*)?[_\-]?
is a 0-length pattern, so:
(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+
says "find me 1 or more of that 0-length pattern I just gave you".

Unsurprising, therefore, that it might have problems.

Winston
 
dpkcv cv
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It would also be nice to see the relevant piece of code.


Thanks for your reply.This is the code used for validating the string.


Also, as far as I can see, the pattern will allow something like "a_ - _ _ -". Is that what you want?


yes this is a valid input.


 
Maneesh Godbole
Saloon Keeper
Posts: 11178
15
Android Eclipse IDE Google Web Toolkit Java Mac Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
dpkcv cv wrote:
Please check your private messages for an important administrative matter.
 
Maneesh Godbole
Saloon Keeper
Posts: 11178
15
Android Eclipse IDE Google Web Toolkit Java Mac Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In future while posting code, please UseCodeTags. I had added them to your post. As you can the code is much more easier to read and understand.
Also, you can make use of the "quote" button.
 
dpkcv cv
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gupta Tarun wrote:How about doing a negative check on "_ and - cannot be consecutive" i.e. should not contain '__' and '_-' and '--' and '-_'.



1. First word should contain only [0-9a-zA-Z]
2. Second word can contain any of these characters[0-9a-zA-Z],space,"_" and "-".But same characters(space,"_" and "-") cannot be consicutive.For examples "__","--"," ","_-" are invalid.But "- _" is valid.
3.
i) Hello world_-(invalid)
ii) Hello world_ -(valid)
iii) Hello world--(invalid)
iv) Hello world- -(valid)
v) Hello world_ _(valid)
vi) Hello world(invalid) - 2 spaces
 
Rob Spoor
Sheriff
Pie
Posts: 20667
65
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"dpkcv cv", I strongly urge you to read your private messages as Maneesh has suggested.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
dpkcv cv wrote:1. First word should contain only [0-9a-zA-Z]
2. Second word can contain any of these characters[0-9a-zA-Z],space,"_" and "-".But same characters(space,"_" and "-") cannot be consicutive.For examples "__","--"," ","_-" are invalid.But "- _" is valid....

Ah. It would have been useful to have these rules at the start.

Regexes are powerful, but they're not omnipotent. It's possible you could create one to cover all your needs, but it's likely to be awfully arcane.
Why not just break down the rules, viz something like:A bit more code than yours, but much clearer than some horrendous regex I reckon.

Winston

Edit: Oops. Keep forgetting that Java matches the entire string by default .
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic