• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regular expression help

 
Ranch Hand
Posts: 441
Scala IntelliJ IDE Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm after a regular expression that will capture words, defined as

- Letters A-Za-z
- including optional single "." at the end
- bounded by spaces or the beginning / end of input

My attempt so far isbut this doesn't work because \b turns out to consider the "." as an end of word boundary, so it would (wrongly) capture "y." from the token "y.o" as a word.

I know that \s represents a space, \A is the start of input, and \Z (or possibly \z?) is the end of input. I also tried but that gives an exception.
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I tried to lookup in the API documentation what \b (a word boundary) means exactly, but it looks like the API docs nor the tutorial don't exactly specify what it means. So, I'd try something else instead that is defined more clearly.

Luigi Plinge wrote:... but that gives an exception.


When you get an exception, please tell us what exception, with the stack trace if possible - the more specific information you give us, the easier it is to help you.

I tried that line out and got:

What did you mean by \A? That's not a valid escape sequence in regular expressions. (The same for \Z at the end).
 
Luigi Plinge
Ranch Hand
Posts: 441
Scala IntelliJ IDE Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
From the documentation of Pattern

Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

 
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
To my mind there is ambiguity in your requirement statement :-

1) Do you want to ignore anything that is not a 'word' ? For example, if the input is "Hello \\\\\\\\\\\\\\\ World" do you just want to extract "Hello" and "World" ?
2) Do you want to include any terminating '.' as part of the result? For example, if the input is "Hello. World" do you want to extract "Hello." and "World" or do you want to extract "Hello" and "World"?
3) Do you want to ignore the first word in your input if it is not prefixed by a space? For example, if the input is "Hello World" do you just want "World" since "Hello" is not prefixed by a space?
4) What do you want the result of input of "Hello.World" to be?

If it is difficult to create a formal specification then a good approach is to define a set of test cases and the result you expect. Make sure you consider the edge conditions such as those above.


 
Luigi Plinge
Ranch Hand
Posts: 441
Scala IntelliJ IDE Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
James -
1) yes
2) yes
3) no
4) ""

So the following are words :
"ab", "ab."

The following are not words :
"ab..", "a.b", ".ab", "a.b.", "a2b."

I got it working by defining 4 separate patterns thus: although, surely there is a better way?

Actually I think it would be a lot easier just to use the split(" ") method on the input String, and match each substring to "[A-Za-z]+\.?". But I'd be interested to hear if it's possible to form a regular expression that includes the split, and why my 2nd attempt in the OP is not valid.
 
James Sabre
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think the following covers all your use cases though I can still think of use cases that this will probably not cover :-


You will need to improve the specification for me to spend any time refining this.

The complexity of the requirement means you will need to create a very very very good JUnit test harness.
 
Luigi Plinge
Ranch Hand
Posts: 441
Scala IntelliJ IDE Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?
2) Why is the first group separate while the third group is nested in the second?
3) What are the cases you mention that it wouldn't cover?
 
James Sabre
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Luigi Plinge wrote:Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?


You need to read the Javadoc for Pattern; in particular you need 'look ahead' and 'look behind'.


2) Why is the first group separate while the third group is nested in the second?



Since neither are capturing groups, both the 'look behind' term and the 'look ahead' term can be either inside or outside of the capturing group BUT in this case the capturing group is not needed anyway. My final version of the regex was -

where one extracts group() rather than group(1) .


3) What are the cases you mention that it wouldn't cover?



Can't remember now. At my age I have trouble remembering my own name.
 
Luigi Plinge
Ranch Hand
Posts: 441
Scala IntelliJ IDE Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Many thanks for your help. Regular expressions suddenly don't seem so difficult.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic