• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regex explaination

 
Ranch Hand
Posts: 238
1
Eclipse IDE Fedora Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,
I am confused regarding a regular expression

(?=^[\w|\s])[\w|\s]

Now what this does is matches the first character for any string.
I am having difficulty understanding the behaviour.
Please help me out.

THanks,
Sudhanshu
 
Greenhorn
Posts: 28
1
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Sudhanshu for an interesting question

Let us breakdown your reg-ex:
(?=^[\w|\s])[\w|\s]

1. (x|y) Find any of the alternatives specified within the brackets
2. ^n Matches any string with n at the beginning of it
3. ?=n Matches any string that is followed by a specific string n.
4. \w Find a word character
5. \s Find a whitespace character

Lets start with 2 and apply to our expression:
^[\w|\s] - This means we need word or a space at the start of the line
Let us apply # 3 to this
?=^[\w|\s] - This means we need to match any string which is followed by word or a space at the start of the line
Applying 1 to this would be:
(?=^[\w|\s]) - Since bracket contains only 1 value as defined by above - so it can be any word or space at the start of the line

Hope this clarifies.




 
Sheriff
Posts: 22784
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Anuj Sharma R wrote:1. (x|y) Find any of the alternatives specified within the brackets


True, but the regex contains [x|y]. That's a character class, and | has no special meaning inside character classes. In other words, it means one character matching x, | or y.

3. ?=n Matches any string that is followed by a specific string n.


? has two types of meaning: either a quantifier (zero or one) if it follows another part of a regular expression, or it's part of a lookahead / lookbehind if it follows a (. The latter is the case here, so (?=x) means that x must be matched, but it will not be part of the match itself.

^[\w|\s] - This means we need word or a space at the start of the line


Or a | itself.

This regex is a bit unnecessarily complex. The regex without the positive lookahead says that it must be a word character, whitespace or |. The lookahead itself says that from the start of the match, there must be a word character, | or whitespace but only as the start of the string. The entire regex can be shorted to ^[\w|\s].
 
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Anuj Sharma R wrote:
Hope this clarifies.



Sorry but it does not clarify because your explanation is wrong. Yes (x|y) denotes alternation but [x|y] is being used and this does not denote alternation but a character set that actually contains a '|' ! This means that if used with matches() the regex will match a single word character, space or '|' character and if used with find() will match anything starting with a word character, space or '|' character.

I suspect that whoever wrote that regex did not have a good understanding of regular expressions since the look forward would seem to be redundant.

edit - too slow.
 
Anuj Sharma R
Greenhorn
Posts: 28
1
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Rob and Richard for clarifying further on this. I agree this regex is un-necessarily complex.
 
Ranch Hand
Posts: 514
1
Eclipse IDE Java
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Behavior of this regex depends on multiline mode of java.util.regex.Pattern.

So when MULTILINE is used then caret symbol ^ denotes start of each line. When MULTILINE is not used, ^ denotes start of entire input string passed to Pattern.matcher() method.

Now to see this difference in practice put this code into main method :
The result is :

Next try to remove Pattern.MULTILINE from Pattern.compile method and you will see that result is

Because caret ^ was treated as start of entire input.

To know more about regex you may find useful my tutorial.
 
Sudhanshu Mishra
Ranch Hand
Posts: 238
1
Eclipse IDE Fedora Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks all for replies.
I was actually working on a problem statement which is "hide the first two characters of a string with length 4"
Input: abcd
Output : **cd

I came up with a regex
(?=^[\w]{4}$)[\w](?![\w]{0,2}$)

This basically checks whether the length is 4 first, then it checks if a letter is followed by a string with length greater than 2.
But what it is actually doing is *bcd.
I guess the next time it starts checking the regex it fails because now it is not on the start of the string.
But I am not sure what could be a workaround.
Any help is much appreciated.

Thanks in advance
 
Rob Spoor
Sheriff
Posts: 22784
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why not simply use String.length and String.substring?
 
Sudhanshu Mishra
Ranch Hand
Posts: 238
1
Eclipse IDE Fedora Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
design constraints
Have to do it by regex only
 
Rob Spoor
Sheriff
Posts: 22784
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's a stupid constraint. Using two simple methods from String is easier to write, easier to understand and probably more efficient as well.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Spoor wrote:That's a stupid constraint.


Agreed. Unless it's for a CS class or exam.

@Sudhanshu: regexes are great, but they're NOT good for everything; and they're generally SLOWER than traditional logic.

As far as your original one is concerned, if you simply want the first character, then "^." will do the trick. And if you want to find the first non-space character, then "^[\s]*." (with appropriate brackets if you want to capture it).

But, I'm in complete agreement with Rob: Do it the way that's easiest for other people to understand. Regexes are arcane enough without adding a bunch of unnecessary logic to them.

Winston
 
Bin Smith
Ranch Hand
Posts: 514
1
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

And if you want to find the first non-space character, then "^[\s]*."


For non-space characters we use \S but for space characters - \s.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Volodymyr Levytskyi wrote:For non-space characters we use \S but for space characters - \s.


Yes, but if you want to find the first non-space character, you first have to eliminate all the leading space ones.

Winston
reply
    Bookmark Topic Watch Topic
  • New Topic