• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex experts - Why does this email address pattern matching code never return?  RSS feed

 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My skills at regex aren't too great but I have a relatively simple regex which is expected to match email address patterns from a string. The code is pretty straightforward and works fine except when it gets passed a very specific input. While trying to match that input it goes into a never ending loop within the Java regex library. The CPU gets pegged to 100% and even after a few hours of running it never completes. Here's a very simple code which reproduces it:


Here's the thread dump taken when this code never returned after a few minutes:



Multiple thread dumps at regular interval show the same stacktrace. I'm on:



I did find a similar JDK bug reported here http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6882582 but in my case the code never returns back not even with that error.

Does anyone know why this regex pattern with that input would trigger this never ending loop within the JDK classes? Any inputs on how to get this working or do you see something that can be improved in that regex pattern?



 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jaikiran Pai wrote:My skills at regex aren't too great but I have a relatively simple regex which is expected to match email address patterns from a string. The code is pretty straightforward and works fine except when it gets passed a very specific input.

What input? Just off the top of my head, that '\\w+' looks wrong, but without knowing when it has problems, it's difficult to say.

My advice: Write out, in English, what you expect that regex to do. You can even post it here if you want. There are plenty of regex dweebs here who will tell you if you're wrong.

Winston
 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
What input?

The one in the code I posted:



Winston Gutkowski wrote:
My advice: Write out, in English, what you expect that regex to do.

I idea is to match/find a string which could be considered as a email addresses. So:

1) A word (ex: abcd)
2) possibly followed by certain special characters (ex: the underscore character, the dot character)
3) possibly followed by another word again
4) Followed by the @ character
5) Followed by another word (ex: gmail)
6) Followed by the dot character
7) Followed by another word (ex: com)
8) Possibly followed by a combination of dot character and a word (ex: .co.uk)

Putting that all together it would be abcd_xyz@gmail.co.uk (just a random example).

Winston Gutkowski wrote:
There are plenty of regex dweebs here who will tell you if you're wrong ;)

That's what I'm hoping for

 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My guess is that it will terminate eventually, given enough time and memory. Have you tried it with a much shorter input string only composed of underscore characters? This could be a situation where the time required grows exponentially with the length of the input string; see http://ocpsoft.org/regex/how-to-interrupt-a-long-running-infinite-java-regular-expression/
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jaikiran Pai wrote:1) A word (ex: abcd)

OK, well the boundary is '\\w', NOT '\\w+'. You also need to define what a "word" is.

2) possibly followed by certain special characters (ex: the underscore character, the dot character)

Too vague. What characters? Precisely. E-mail addresses have a specification, so you need to get those from there.

3) possibly followed by another word again

Again: too vague. Do you specifically mean the characters detailed in Steps 1 and 2?

4) Followed by the @ character

Fair enough. But is an e-mail address allowed to have TWO '@'s?

5) Followed by another word (ex: gmail)

See above.

6) Followed by the dot character

Fine.

7) Followed by another word (ex: com)

Are you sure?

8) Possibly followed by a combination of dot character and a word (ex: .co.uk)

I think you'll find that these last 3 things are covered by the specification for domain names.

Putting that all together it would be abcd_xyz@gmail.co.uk (just a random example).

Right. But, as you're discovering, what you really want are examples that MUSTN'T match.

HIH

Winston
 
Bin Smith
Ranch Hand
Posts: 514
1
Eclipse IDE Java
  • Likes 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is happening because of backtracking of greedy quantifiers in regex.
To avoid backtracking I know two choices : atomic group or possessive quantifiers. Both mean match as many symbols as possible with greedy quantifier and never release(backtrack) symbols.
So I just used atomic group for first greedy plus and problem has gone.
So, regex becomes:

(?>\\w+) means, that it matches as many one of [a-zA-Z0-9_] as possible and never release matched symbol.

This article will be useful to read.
 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:My guess is that it will terminate eventually, given enough time and memory. Have you tried it with a much shorter input string only composed of underscore characters? This could be a situation where the time required grows exponentially with the length of the input string;

Good point. I hadn't read about the backtracking so wasn't aware of the exponential time possibility. It looks like you are right, since I modified the code to print out the time it takes to do the computation, every time I add one more underscore character in the input:



Here are the timings and the program hasn't yet finished:
 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Jaikiran Pai wrote:1) A word (ex: abcd)

OK, well the boundary is '\\w', NOT '\\w+'.


Winston Gutkowski wrote:
You also need to define what a "word" is.

For this example, as noted in the local part section http://en.wikipedia.org/wiki/Email_address#Local_part I'm considering multiple instances of the characters between a-z or A-Z or the digits 0-9.

Winston Gutkowski wrote:
2) possibly followed by certain special characters (ex: the underscore character, the dot character)

Too vague. What characters? Precisely. E-mail addresses have a specification, so you need to get those from there.

In this example, I'm only considering + - . _ and % which are all allowed by the spec.

Winston Gutkowski wrote:
3) possibly followed by another word again

Again: too vague. Do you specifically mean the characters detailed in Steps 1 and 2?

More precisely, I'm considering multiple instances of the characters between a-z or A-Z or the digits 0-9 here too.

Winston Gutkowski wrote:
5) Followed by another word (ex: gmail)

See above.

In this example, I'm considering multiple instances of the characters between a-z or A-Z or the digits 0-9.

Winston Gutkowski wrote:
7) Followed by another word (ex: com)

Are you sure?

The domain name, allows letters and digits (which is what I vaguely called a word) to be part of the domain. So yes.

 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Volodymyr Levytskyi wrote:This is happening because of backtracking of greedy quantifiers in regex.
To avoid backtracking I know two choices : atomic group or possessive quantifiers. Both mean match as many symbols as possible with greedy quantifier and never release(backtrack) symbols.
So I just used atomic group for first greedy plus and problem has gone.
So, regex becomes:

(?>\\w+) means, that it matches as many one of [a-zA-Z0-9_] as possible and never release matched symbol.

Thank you for explaining that change, I'll give that a try.

Volodymyr Levytskyi wrote:
This article will be useful to read.


I was actually reading that very article from the one Ulf linked to. I'm going to spend some more time to understand it completely.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37496
539
Eclipse IDE Java VI Editor
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another approach is to simplify the problem. Real humans don't have long strings of underscores in their email. If you see say five or more underscores, you could skip the regexp check completely and just assume it should be invalid/removed from the list.
 
Akaine Harga
Ranch Hand
Posts: 99
Java MyEclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I always used this regexp to validate emails, works just fine: ^[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})$
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I understand it, the problem is not validating emails, it's finding emails in a larger piece of text - a slightly different problem. If the problem is validation, then using something like org.apache.commons.validator.routines.EmailValidator is likely a better approach than trying to roll one's own (which invariably is either too strict or too loose).
 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:As I understand it, the problem is not validating emails, it's finding emails in a larger piece of test

That's correct.
 
Jaikiran Pai
Sheriff
Posts: 10447
227
IntelliJ IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The article about regex backtracking helped. While reading that article and co-relating it with the email pattern I had, I realized there was a small but significant oversight in that pattern:



a \w already includes the _ character so there's no need to include it in the second group in that pattern. I changed that pattern to:



and that fixed the backtracking issue which was causing the never ending loop.

 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!