Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex

 
Richard Teston
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can anyone of you guys can give me a regular expression pattern in finding words with period like "end." but not "end..". I try "\\b[a-zA-Z]+\\.?" and "[a-zA-Z]+\\.?" but the word with two periods are still counted in the matches. Does anybody here have any suggestions or can give me the right pattern? Thanks.
 
Phil Chuang
Ranch Hand
Posts: 251
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could use something like \\b[A-Za-z]+\\.+ which will get
asdf.
asdf..
asdf...
etc. (literally!)
and then just discard the ones with multiple periods?
[ October 10, 2003: Message edited by: Phil Chuang ]
 
Richard Teston
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
..but logically the pattern "\\b[a-zA-Z]+\\.?\\b" should match all words one[b]"with one period"[\b] and those words [b]"without a period"[\b] because everyone know that a metacharacter [b]"?"[\b] is an optional which means the pattern above may have match one period or nothing at all. Does this means that the Java Regular Expression engine have a bug?. Maybe I have a wrong pattern? Please tell which is which because I tried the pattern above using underscore "_" instead (i.e. "\\b[a-zA-Z]+_?\\b" ) the matched worked fine. Please enlighten me.Thanks
 
Adrian Yan
Ranch Hand
Posts: 688
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hmm... you can't use quantifier in this case, because it doesn't check for the next character when your match ends.
Here is the one I tested: I have to apologize cause I ran this in Tcl.
[a-zA-Z]+([.])([^.])
Hope this helps.
 
Richard Teston
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Adrian for the pattern but I'm sorry, that too doesn't work as well if fact it did'nt find any word in the test string "The quick brown fox. jumps..". This is weird... I try the pattern "\\." and it matches all the period in my string. Does anybody have an idea?
 
Phil Chuang
Ranch Hand
Posts: 251
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And you'd think this would do it, but I can't get it to work:
"\\b[A-Za-z]+[.][^.]"
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well there's something icky like
\b[A-Za-z]+\b(?:\s|$|\.(?:[^.])|$)
(Word folloed by space or end of line or [a . followed something other than ., or endo f line])
You can use negative lookahead to simplify:
\b[A-Za-z]+\b(?!\.\.)\.?
(Word not followed by .. but maybe followed by .)
Or also use a posessive quantifier (available in java.util.regex, bot not most other regex libraries) to make it easier to avoid partial words:
\b[A-Za-z]++(?!\.\.)\.?
(Save as previous, really, but maybe a little faster)
All the above are regexes, not Java literals, so double each \ to make a literal.
 
Mani Ram
Ranch Hand
Posts: 1140
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jim Yingst:

folloed, endo f line, Save as previous

Jim, are you in a Friday evening hurry?
[ October 10, 2003: Message edited by: Mani Ram ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
..but logically the pattern "\\b[a-zA-Z]+\\.?\\b" should match all words one "with one period"
No, because after the period, it looks for a word boundary \b, and it doesn't (usually) find one because the word already ended before the '.'.
and those words "without a period" because everyone know that a metacharacter "?" is an optional which means the pattern above may have match one period or nothing at all.
It also mathes words with two periods, because it can simply ignore the period (the ? means it's not required to take it) and it's still at the word boundary, so the final \b matches.
Does this means that the Java Regular Expression engine have a bug?.
Dunno if there are other bugs, but this isn't one - it's a problem with your pattern.
The classic reference for learning about regexes is Mastering Regular Expressions by Jeffrey Friedl. Highly recommended Also Max Habibi (bartender here at the ranch) hasn the upcoming Real World Regular Expressions with Java 1.4 which will be worth checking out, focusing more specifically on Java's java.util.regex package. Also useful: if you use Eclipse, try the RegEx Tester plug-in. Or for mroe traditional regexes (no possessive queitifiers) you can use the Regex Coach. There are probably others; these are the ones I've tried.
[ October 10, 2003: Message edited by: Jim Yingst ]
 
Richard Teston
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the explanation Jim I've been reading Mastering Regular Expression by Jeffrey Friendl I haven't reach the part of the book about how DFA and NFA engine evaluate the regex. I thought my pattern is right because obviously you can tell what this pattern --> \\b[a-zA-Z]+\\.?\\b really wants but the regex engine does not interpreted it that way. I really must study how regex engine evaluate regular expression pattern, but this of course depends on the engine. Anyway does anyone of you guys know what engine does java regex use? Is it (NFA,DFA,DFA(POSIX)...)?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's NFA. As are most other serious modern regex packages. But the issues with your pattern here have little to do with that, and more to do with what a word boundary is. In a string like "foo. " there's a word boundary between 'o' and '.', but not between '.' and ' '. So the only way your regex matches something is by not matching the "\\.?" part (since it's got a ?, this is OK). It can do this even if there is a '.' next in the target string, or even "..". The matcher tries to match your entire regex, and is willing and able to backtrack from an optional match (?) if it needs to, in order to be able to match the final /b.
[ October 11, 2003: Message edited by: Jim Yingst ]
 
Richard Teston
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks again Jim for the enlightenment.About the engine I'm just curious about the java regex engine and you are right it's not the issue with my pattern.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic