• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Seemingly simple regex making my head hurt

 
Ranch Hand
Posts: 268
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hey all,

I have a seemingly simple regex problem that has me momentarily stumped. While I'm waiting for the aspirin to kick in, I was hoping one of y'all (that's JavaRanch-speak) might want to take a crack at it.

I'm trying to write a regex that matches a String according to the following rules:

1. Has one or more of the characters: a-zA-Z0-9
2. Can contain a dash, but NOT as the first or last character in the String.

So examples that would match:
abc
1
1ab4
a-bc
ab-c
a-----d

Examples that would NOT match:
-abc
abc-
a--b-11-

(The dashes cannot appear in the first or last position.)

I cannot depend upon the beginning and end of line markers (^ or $) because I'm planning on defining this regex as a constant and using this constant as part of a larger regex.

So, here's what I've got:



The problem: this regex requires that the matching String be at least two characters long. My first thought was to just put a question mark after the last character class, but then it would match Strings like "abc-" which end in a dash. Not acceptable.

Thanks all!
sev
 
town drunk
( and author)
Posts: 4118
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

String regex = "(?=[^-])[\\p{Alnum}]*-*[\\p{Alnum}]+";


HTH,
M
 
Ranch Hand
Posts: 1365
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One of those days, eh? This is a case where playing with lookbehinds/lookaheads and other regex gadgets might be fun, but there's a simple solution that works and should be understandable by anyone familiar with basic regex syntax. I hope UBB code doesn't try to interpret this:

[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]

When you have a regex that handles most cases and misses a few, it's always an option fill in the missing cases as an entirely seperate regex. It gets a little more funky when you have a regex that matches too many cases.
 
Max Habibi
town drunk
( and author)
Posts: 4118
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I prefer David's answer, because it's less complex. It's slightly less efficient, but I'm guessing it would be hard to measure the difference.

The only adjustment I would make, and this is stylistic, is the following



HTH,
M
[ June 08, 2004: Message edited by: Max Habibi ]
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Gotta watch those pesky backslashes, eh Max?

I suspect efficiency here also depends on the input - are successes or failures going to be more common? Are single-character inputs common, or rare?

There's also a difference in behavior between the two solutions. Max' original solution will not allow 123-45-67, while David's will. It's not 100% clear which of these is intended according to the instructions (which require that we allow "a dash", but say nothing about multiple dashes, except there's an example that shows multiple consecutive dashes). Maybe it doesn't matter. But my guess is David's soltion is correct. I'd probably formulate it as

"[a-zA-Z0-9]+(\\-+[a-zA-Z0-9]+)*"

or

"[a-zA-Z0-9]++(?:\\-++[a-zA-Z0-9]++)*+"

or

"\\p{Alnum}++(?:\\-++\\p{Alnum}++)*+"

I think the first is probably most understandable to people now, but the latter forms offer improvements I'd like to see more commonly used. That the possessive forms ++ and *+ aren't really necessary, and may be changed to + and * respectively - but I think in this case they lead to the fastest solution possible, eliminating unnecessary backtracking. Which also helps readability, IMO, assuming the reader is familiar with possessive forms.
[ June 08, 2004: Message edited by: Jim Yingst ]
 
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How is the regex going to be used? If you're using the matches() method (that is, the value to be matched makes up the entire target string), then either David's or Jim's regex will work. But if you use Matcher#find() to pluck the values out of a longer string, David's regex will stop after matching a single letter or digit. Also, given a target string like "test -123-456- test", both regexes will ignore the leading and trailing hyphens and return 123-456. If you don't want that to happen, you can use negative lookbehind to prevent it:Note that the possessive quantifier is not there just for efficiency's sake; if the last thing the regex engine sees is a hyphen, we don't want it to back off and match something shorter, we want it to fail. You could also use negative lookahead for the trailing hyphen:Here, again, if we weren't using possessive quantifiers, the lookahead would have to do more work:Again, all this applies only if you're using find() rather than matches().
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic