• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

regex postive lookbehind

 
Ranch Hand
Posts: 88
Netbeans IDE Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know what is going on here. I have an application with this code to count the number of "the" in a text file. I bring in the whole text file as a String ArrayList and check for a space or beginning of a line to identify a "the" or "The".

Here is the code:



I feed it this String:


The skunk sat on the stump.
I just hit a mother load!
Some mothers sell handmade quilts and others sell chandeliers.
Go down to the beach and build a sand castle!
You said the man in the field was Anderson didn't you?
How did you like the play?
You can be a theologian if you study hard.
Theocracy is a word.
Have you seen Thelma and Louise?
I am the great and powerful The!
How hard could the ball be thrown?
What is the time for all men to come to realize that they need a good woman?
Is Raytheon the aircraft company of the future?


It finds 16 instances of "the" and finds it in the middle of words (which it shouldn't) and on the 6th line, it completely misses a "the".

I attached a screenshot of the words it finds highlighted in yellow. I don't know why it behaves like this.


words.png
[Thumbnail for words.png]
Highlighted Words
 
Bartender
Posts: 5167
11
Netbeans IDE Opera Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's not what I get, using your regex. Of course, words like Theologian, they etc are also matched -- to prevent that you need a look-ahead for a space or end-of-input/line.

Prints:
 
Mack Wilmot
Ranch Hand
Posts: 88
Netbeans IDE Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Darryl, what version of JDK are you using? I am using v7u10 64bit.

Thanks!

EDIT: Well never mind I just used your code and it works... maybe it has something to do with me running it in a new thread or something...
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mack Wilmot wrote:Darryl, what version of JDK are you using? I am using v7u10 64bit.


The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston
 
Mack Wilmot
Ranch Hand
Posts: 88
Netbeans IDE Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:
The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston



Well, it still should be matching all the "the" and not skipping 2 in the middle of the string. Also your regex gives me similar flawed results. If I just try something simple like matching "\\s[Tt]he\\s" is works and doesn't miss anything. Very perplexing.

Thanks!
 
Bartender
Posts: 3323
86
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Well, it still should be matching all the "the" and not skipping 2 in the middle of the string.


The regex works for me, which two is it skipping?

If I just try something simple like matching "\\s[Tt]he\\s" is works


Are you sure. It doesn't work for me, it correctly fails to match the very first The and the one followed by a '!'
 
Mack Wilmot
Ranch Hand
Posts: 88
Netbeans IDE Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I found the problem. I was using a find for the highlight and it was of course finding the next "the" regardless of where it was. It works for matching all "the" no matter where it is. I have not worked on my coding skills in a very long time and was working on old code and forgot what I had written before did. lol

EDIT: Thanks Tony!
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic