Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

regex postive lookbehind

 
Mack Wilmot
Ranch Hand
Posts: 88
Linux Netbeans IDE Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know what is going on here. I have an application with this code to count the number of "the" in a text file. I bring in the whole text file as a String ArrayList and check for a space or beginning of a line to identify a "the" or "The".

Here is the code:



I feed it this String:


The skunk sat on the stump.
I just hit a mother load!
Some mothers sell handmade quilts and others sell chandeliers.
Go down to the beach and build a sand castle!
You said the man in the field was Anderson didn't you?
How did you like the play?
You can be a theologian if you study hard.
Theocracy is a word.
Have you seen Thelma and Louise?
I am the great and powerful The!
How hard could the ball be thrown?
What is the time for all men to come to realize that they need a good woman?
Is Raytheon the aircraft company of the future?


It finds 16 instances of "the" and finds it in the middle of words (which it shouldn't) and on the 6th line, it completely misses a "the".

I attached a screenshot of the words it finds highlighted in yellow. I don't know why it behaves like this.


words.png
[Thumbnail for words.png]
Highlighted Words
 
Darryl Burke
Bartender
Posts: 5148
11
Java Netbeans IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's not what I get, using your regex. Of course, words like Theologian, they etc are also matched -- to prevent that you need a look-ahead for a space or end-of-input/line.

Prints:
 
Mack Wilmot
Ranch Hand
Posts: 88
Linux Netbeans IDE Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Darryl, what version of JDK are you using? I am using v7u10 64bit.

Thanks!

EDIT: Well never mind I just used your code and it works... maybe it has something to do with me running it in a new thread or something...
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mack Wilmot wrote:Darryl, what version of JDK are you using? I am using v7u10 64bit.

The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston
 
Mack Wilmot
Ranch Hand
Posts: 88
Linux Netbeans IDE Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston


Well, it still should be matching all the "the" and not skipping 2 in the middle of the string. Also your regex gives me similar flawed results. If I just try something simple like matching "\\s[Tt]he\\s" is works and doesn't miss anything. Very perplexing.

Thanks!
 
Tony Docherty
Bartender
Posts: 2989
59
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, it still should be matching all the "the" and not skipping 2 in the middle of the string.

The regex works for me, which two is it skipping?

If I just try something simple like matching "\\s[Tt]he\\s" is works

Are you sure. It doesn't work for me, it correctly fails to match the very first The and the one followed by a '!'
 
Mack Wilmot
Ranch Hand
Posts: 88
Linux Netbeans IDE Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found the problem. I was using a find for the highlight and it was of course finding the next "the" regardless of where it was. It works for matching all "the" no matter where it is. I have not worked on my coding skills in a very long time and was working on old code and forgot what I had written before did. lol

EDIT: Thanks Tony!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic