Win a copy of The Journey To Enterprise Agility this week in the Agile and Other Processes forum! And see the welcome thread for 20% off.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Campbell Ritchie
  • Tim Cooke
  • Bear Bibeault
Sheriffs:
  • Paul Clapham
  • Junilu Lacar
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Ganesh Patekar
  • Tim Moores
  • Pete Letkeman
  • Stephan van Hulst
Bartenders:
  • Carey Brown
  • Tim Holloway
  • Joe Ess

Regular Expression Help  RSS feed

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

For a search application we are developing, I need to read sentences from a file. Sentences are delimited by fullStop. The readline() method in BufferedReader reads an entire paragraph. To split it into sentences, I use the public String[] split(String regex) method of the String class as follows.

String[] sentencesInParagraph = paragraph.split("[.]");

The problem is that there are words like B.C. in the paragraph and "B" and "C" get read as separate sentences. Is it possible to construct a regular expression that splits the paragraph on the fullstop in general, but excludes specific abbreviations like Ph.D, Mr. etc??

Thanks in Advance,
Vinod.
 
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vinod:

Welcome to Java Ranch!

Just for clarification, are you saying that the sentences in your text run with no newlines or carriage returns?

John.
 
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
can you try the following ...instead of [.]

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

* A newline (line feed) character ('\n'),
* A carriage-return character followed immediately by a newline character ("\r\n"),
* A standalone carriage-return character ('\r'),

source http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
 
jittu goud
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
or normally a fullstop would be followed by white space you can do this


String[] sentencesInParagraph = paragraph.split("[.][\\s]");
 
Marshal
Posts: 59700
187
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That will only work until somebody calls you J. Goud . . .
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

John de Michele wrote:Vinod:

Welcome to Java Ranch!

Just for clarification, are you saying that the sentences in your text run with no newlines or carriage returns?

John.



Yes John. And Sorry for the delay in responding..
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

jittu goud wrote:can you try the following ...instead of [.]

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

* A newline (line feed) character ('\n'),
* A carriage-return character followed immediately by a newline character ("\r\n"),
* A standalone carriage-return character ('\r'),

source http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html



Sorry jittu... I have no control over the format of the input documents. Its actually the output produced by a tool called nutch, which is used to crawl web pages.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

jittu goud wrote:or normally a fullstop would be followed by white space you can do this


String[] sentencesInParagraph = paragraph.split("[.][\\s]");




Thanks jittu i overlooked that.. its definitely an improvement over the code that i had.. That should handle most scenarios.. Will get back to you if i run into problems..

Vinod
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:That will only work until somebody calls you J. Goud . . .



S.. it doesnt work in all scenarios.. But its a start, I think..

Vinod
 
Sheriff
Posts: 21421
94
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is no foolproof algorithm for this. Even human beings can misread a sentence, ending early or late because they miss a period or misinterpret a period as something that is not the sentence end.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Rob Prime wrote:There is no foolproof algorithm for this. Even human beings can misread a sentence, ending early or late because they miss a period or misinterpret a period as something that is not the sentence end.



S Rob.. But I am wondering if it might be possible to have a list of known "spoilsport"s like Ph.D, B.C. Mr. Dr. etc in a list and have the split method ignore the fullstops in them while splitting..
 
Rob Spoor
Sheriff
Posts: 21421
94
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sure you can. You may split them initially, but you can paste them back together.

For instance, if the last "sentence" ends with "<>Ph." (with <> being anything other than numbers and letters), and the next sentence equals "D.", then you should paste these two sentences and the next one into one larger sentence.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Rob Prime wrote:Sure you can. You may split them initially, but you can paste them back together.

For instance, if the last "sentence" ends with "<>Ph." (with <> being anything other than numbers and letters), and the next sentence equals "D.", then you should paste these two sentences and the next one into one larger sentence.



hmmm.. but i will need to check so many conditions.. my code will b FULL of if statements..
 
author
Sheriff
Posts: 23566
138
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

hmmm.. but i will need to check so many conditions.. my code will b FULL of if statements..




Well... one option is to place all those possibilities in an array, and check all those conditions in a loop. It is still running lots of "if" statements -- but in source code, there is only one.

Henry
 
Henry Wong
author
Sheriff
Posts: 23566
138
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hmmmm..... As an alternative, how about adding a requirement that a sentence have more than one word? So, you can have the regex only look for sentences that have a period, and more than one word.

Of course, if you do that, you may have to change your code to use the find() method, as the split() method don't seem to work with the zero-width look (ahead/behind) in all cases.

Henry
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!