Bill Hogsett wrote:Any suggestions?
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Bill Hogsett wrote:While westers and wester are not common words, I would like to treat them as words and get rid of the leading dash. But since I may be handling large documents I don't want to slow the split down much.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Harsha Smith wrote:single regex to answer all your questions
Bill Hogsett wrote:Thasnks Harsha, that got me closer, but missed a few characters (e.g., ' "_). I am now using:
"([\\[_\"()*#.,?!:;]|\\s|'\\-|\\-\\-|'|'\\-\\-)"
My program reports two words that I cannot understand. they are:
-and 2
-when 1
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Stephan van Hulst wrote:...but here is how you could do it using a scanner (yes, using a single regex, sorry)
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split?
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Stephan van Hulst wrote:I agree with Winston about using single regexes, but here is how you could do it using a scanner (yes, using a single regex, sorry):
Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split? One of us will definitely provide you a very good Regex pattern based on the spec.
Please include big sample text .
Stephan van Hulst wrote:See, what English speakers would normally identify as words, that doesn't really compute, unless you incorporate a dictionary and some pretty complex code.
The code I gave you should handle most of your cases, except for words ending with an apostrophe. You will have to discard the apostrophe after you have scanned a token.
It's a pity the IsAlphabetic class doesn't work. Try with \\p{Alpha} instead.
Stephan van Hulst wrote:Don't worry, I don't think he is :P
Bill, you can easily remove the apostrophes with simple code. Just check if the char at index 0 is an apostrophe, and if it is, take the substring at index 1. I'm sure you can handle the case where there's an apostrophe at the end too.
Stephan van Hulst wrote:Bill, you can easily remove the apostrophes with simple code...
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime. |