Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex: Keep Single dashes Negative Lookahead

 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am breaking a text file into words for processing. But I want to treat as a single word words that contain a single dash.

For example "Oh-wow" should be one word and "But--not--this" should be three words. The double dashes could be at the start, middle or end of a word.

I think I need to use negative look ahead, but am not sure of that and am not sure how to do it.

My current pattern is:



But it does not work.

My normal test file is the Gutenberg Project's Moby Dick.txt.

Any suggestions?

Thanks.

Bill Hogsett
 
Stephan van Hulst
Bartender
Posts: 6327
78
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I haven't tested this, and I don't use regexes that often, but maybe this will give you an idea:
This pattern essentially says: Match anything that starts with at least one letter, followed by zero or more groups that start with a dash and at least one letter.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bill Hogsett wrote:Any suggestions?

Yes. Don't try to do it all with regexes (or at least not all at once).

I agree with Harsha that String.split() is probably what you want initially, although I think I'd probably go with
sentence.split("\\s+")
myself.

That splits your text into whitespace-delimited "words". Once you have those, then decide what a word really is.

You might even want to return the words as a List, so that you can split up existing ones if need be. For example:
Winston
 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That was my 100th post. Hope OP and others find it useful.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harsha Smith wrote:That was my 100th post.

Congrats.

Winston
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Harsha! 100 posts. That is great. Having communities like this really helps.

Here is what I am using now:



and then I call:




}

Using your //s for the original split didn't strip out punctuation.

Is there a better way to do the first split?

Thanks.

Bill
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harsha, I now have one (I hope) problem.

In the following from Moby Dick, I end up with a word "-Westers".

" So that Monsoons, Pampas, Nor'-Westers,
Harmattans, Trades; any wind but the Levanter and Simoon, might
blow Moby Dick into the devious zig-zag world-circle of the Pequod's
circumnavigating wake."

And here I get "-wester":

"Here comes another with a sou'-wester and a bombazine cloak."

While westers and wester are not common words, I would like to treat them as words and get rid of the leading dash. But since I may be handling large documents I don't want to slow the split down much.

Bill
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bill Hogsett wrote:While westers and wester are not common words, I would like to treat them as words and get rid of the leading dash. But since I may be handling large documents I don't want to slow the split down much.

I honestly wouldn't worry about it. What is Moby Dick: 100,000 words? Any loop will process that in a split-second. It's far more likely that your delay will be with I/O.

Winston
 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
single regex to answer all your questions
 
Campbell Ritchie
Sheriff
Pie
Posts: 50241
79
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And that even keeps “zig-zag” as one word.
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harsha Smith wrote:single regex to answer all your questions


Thasnks Harsha, that got me closer, but missed a few characters (e.g., ' "_). I am now using:

"([\\[_\"()*#.,?!:;]|\\s|'\\-|\\-\\-|'|'\\-\\-)"

My program reports two words that I cannot understand. they are:

-and 2
-when 1

The numbers are the number of usages in Moby Dick. Looking at the document and searching for --and|when I don't see any pattern that would get those results. Melville used "--" preceded by a character (e.g., '-- :-- ;-- !--) but none of them seem to show me a pattern to use or to get those results.

Any suggestions? (I can live with what I have, but ...)

One last question. Your code did not output anything when I ran it in NetBeans. I had to do String[] wordarr = words.split(regex) before the loop and then use wordarr in the loop. Does that make sense to you? I haven't tested outside of NetBeans. Running 1.6 in NetBeans.

Thanks again.

Bill
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bill Hogsett wrote:Thasnks Harsha, that got me closer, but missed a few characters (e.g., ' "_). I am now using:

"([\\[_\"()*#.,?!:;]|\\s|'\\-|\\-\\-|'|'\\-\\-)"

My program reports two words that I cannot understand. they are:

-and 2
-when 1

I'll say it one more time, just in case you missed it earlier: trying to do all this with a single regex is likely to be:
(a) time-consuming
(b) error-prone
(c) result in code (or at least an expression) that is hard for anyone else to decipher and/or change if they need to.
and I say this as a 15-year Unix System Administrator, so I love regex.

If you did as was suggested earlier and break down the problem into 2 parts:
1. Get your whitespace-delimited words.
2. Check each word for a valid pattern.
I suspect you'll have a far more flexible solution.

Just one of the things you would then be able to do is to print out the actual word (or words) that contains "-and", along with some indication of where it was found.

Winston
 
Stephan van Hulst
Bartender
Posts: 6327
78
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree with Winston about using single regexes, but here is how you could do it using a scanner (yes, using a single regex, sorry):
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:...but here is how you could do it using a scanner (yes, using a single regex, sorry)

Hey, no worries about a single regex, providing it's not too arcane. I quite like yours actually.

However, another thought struck me: so far most posts have been concentrating on "getting the delimiter pattern right". If you simply eliminate the whitespace, you could instead concentrate on getting the "word pattern" right. I have no idea whether it's any easier, but it appeals, simply because you're looking for something that's 'correct', rather than trying to eliminate something that's incorrect.

Winston
 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you specify us all the requirements and explain us in detail with examples how you want the words to be split? One of us will definitely provide you a very good Regex pattern based on the spec.

Please include big sample text .

 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tell us if this helps
[edit]Add newlines to make post easier to read. Please avoid long lines in code tags.[/edit]
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split?

Another wrinkle for you (assuming this is English): possessives can sometimes end with an apostrophe, eg "The farmers' fields".

Winston
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:I agree with Winston about using single regexes, but here is how you could do it using a scanner (yes, using a single regex, sorry):


Thanks, but I cannot get this to compile. The error is:

Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character property name {Alphabetic} near index 23
-{2,}|[^\p{IsAlphabetic}'-]+

Bill
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split? One of us will definitely provide you a very good Regex pattern based on the spec.

Please include big sample text .



Specification? I don't need no specification! In the words of a U.S. Supreme Court Justice on another topic, "I know it when I see it."

Seriously, I want to parse a text file and return words that English speakers would normally identify as words. So here are some examples:

afterwards, he smoked. as three words with no punctuation So remove punctuation.
don't and other contractions are maintained (I do not think this is handled currently.)
Oh-my-gosh as one word
He--looking away--said stop as five words
killed!--a big whale--:Moby Dick as six words with no punctuation
Nor'--Wester Not sure here. Certainly Wester as a word, but let's go with Nor and not Nor' as a word

You asked for a big test file. I can't figure out uploading here. Both .txt and .zip filies are rejected. So, get Moby Dick here Moby Dick

Thanks to everyone who has made suggestions. I have not overlooked the suggestion to simplify the regex and do this in steps.

Bill

 
Stephan van Hulst
Bartender
Posts: 6327
78
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
See, what English speakers would normally identify as words, that doesn't really compute, unless you incorporate a dictionary and some pretty complex code.

The code I gave you should handle most of your cases, except for words ending with an apostrophe. You will have to discard the apostrophe after you have scanned a token.

It's a pity the IsAlphabetic class doesn't work. Try with \\p{Alpha} instead.
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:See, what English speakers would normally identify as words, that doesn't really compute, unless you incorporate a dictionary and some pretty complex code.

The code I gave you should handle most of your cases, except for words ending with an apostrophe. You will have to discard the apostrophe after you have scanned a token.

It's a pity the IsAlphabetic class doesn't work. Try with \\p{Alpha} instead.


\\p{Alpha} works. Your pattern handles everything except apostrophes (both at the beginning and end of words). It nicely handles contractions. I can live with the apostrophe at the end. Can you suggest how to get the apostrophe from the beginning of words?

Thanks.

Bill

ps. My first uses suggest that using scanner for each test is slower than using scanner for each line and then using split with a compiled regex pattern.
 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My suggestion is do the basic splitting using a simple regex. Then remove punctuation as shown in my code.

And Bill don't be angry with us


 
Stephan van Hulst
Bartender
Posts: 6327
78
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't worry, I don't think he is :P

Bill, you can easily remove the apostrophes with simple code. Just check if the char at index 0 is an apostrophe, and if it is, take the substring at index 1. I'm sure you can handle the case where there's an apostrophe at the end too.
 
Bill Hogsett
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Don't worry, I don't think he is :P

Bill, you can easily remove the apostrophes with simple code. Just check if the char at index 0 is an apostrophe, and if it is, take the substring at index 1. I'm sure you can handle the case where there's an apostrophe at the end too.


Thanks to everyone for the help. I have what I need and certainly can handle the apostrophe myself.

Harsha, I am not angry with you or anyone here. The forum has provided superlative assistance, code and advice to me.

I consider this closed, but will follow any future posts.

Bill
 
Harsha Smith
Ranch Hand
Posts: 287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bill, Please do come up with more challenging issues often Won't you come to see us tomorrow? Have a nice day!
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Bill, you can easily remove the apostrophes with simple code...

And don't forget that those stupid MS 'smart quotes' aren't apostophes, even though they look like 'em (there's a good word with an apostrophe in front for you). I suspect Stephan's regex'll handle them though.

And then there's always stuff like fo'c'sle (actually, more properly: fo'c's'le)...

Winston
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic