• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regular Expression Help  RSS feed

 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

For a search application we are developing, I need to read sentences from a file. Sentences are delimited by fullStop. The readline() method in BufferedReader reads an entire paragraph. To split it into sentences, I use the public String[] split(String regex) method of the String class as follows.

String[] sentencesInParagraph = paragraph.split("[.]");

The problem is that there are words like B.C. in the paragraph and "B" and "C" get read as separate sentences. Is it possible to construct a regular expression that splits the paragraph on the fullstop in general, but excludes specific abbreviations like Ph.D, Mr. etc??

Thanks in Advance,
Vinod.
 
John de Michele
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vinod:

Welcome to Java Ranch!

Just for clarification, are you saying that the sentences in your text run with no newlines or carriage returns?

John.
 
jittu goud
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
can you try the following ...instead of [.]

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

* A newline (line feed) character ('\n'),
* A carriage-return character followed immediately by a newline character ("\r\n"),
* A standalone carriage-return character ('\r'),

source http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
 
jittu goud
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
or normally a fullstop would be followed by white space you can do this


String[] sentencesInParagraph = paragraph.split("[.][\\s]");
 
Campbell Ritchie
Marshal
Posts: 56223
171
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That will only work until somebody calls you J. Goud . . .
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
John de Michele wrote:Vinod:

Welcome to Java Ranch!

Just for clarification, are you saying that the sentences in your text run with no newlines or carriage returns?

John.


Yes John. And Sorry for the delay in responding..
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
jittu goud wrote:can you try the following ...instead of [.]

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

* A newline (line feed) character ('\n'),
* A carriage-return character followed immediately by a newline character ("\r\n"),
* A standalone carriage-return character ('\r'),

source http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html


Sorry jittu... I have no control over the format of the input documents. Its actually the output produced by a tool called nutch, which is used to crawl web pages.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
jittu goud wrote:or normally a fullstop would be followed by white space you can do this


String[] sentencesInParagraph = paragraph.split("[.][\\s]");



Thanks jittu i overlooked that.. its definitely an improvement over the code that i had.. That should handle most scenarios.. Will get back to you if i run into problems..

Vinod
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:That will only work until somebody calls you J. Goud . . .


S.. it doesnt work in all scenarios.. But its a start, I think..

Vinod
 
Rob Spoor
Sheriff
Posts: 21117
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is no foolproof algorithm for this. Even human beings can misread a sentence, ending early or late because they miss a period or misinterpret a period as something that is not the sentence end.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Prime wrote:There is no foolproof algorithm for this. Even human beings can misread a sentence, ending early or late because they miss a period or misinterpret a period as something that is not the sentence end.


S Rob.. But I am wondering if it might be possible to have a list of known "spoilsport"s like Ph.D, B.C. Mr. Dr. etc in a list and have the split method ignore the fullstops in them while splitting..
 
Rob Spoor
Sheriff
Posts: 21117
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sure you can. You may split them initially, but you can paste them back together.

For instance, if the last "sentence" ends with "<>Ph." (with <> being anything other than numbers and letters), and the next sentence equals "D.", then you should paste these two sentences and the next one into one larger sentence.
 
T Vinod Kumar
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Prime wrote:Sure you can. You may split them initially, but you can paste them back together.

For instance, if the last "sentence" ends with "<>Ph." (with <> being anything other than numbers and letters), and the next sentence equals "D.", then you should paste these two sentences and the next one into one larger sentence.


hmmm.. but i will need to check so many conditions.. my code will b FULL of if statements..
 
Henry Wong
author
Sheriff
Posts: 23292
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hmmm.. but i will need to check so many conditions.. my code will b FULL of if statements..



Well... one option is to place all those possibilities in an array, and check all those conditions in a loop. It is still running lots of "if" statements -- but in source code, there is only one.

Henry
 
Henry Wong
author
Sheriff
Posts: 23292
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hmmmm..... As an alternative, how about adding a requirement that a sentence have more than one word? So, you can have the regex only look for sentences that have a period, and more than one word.

Of course, if you do that, you may have to change your code to use the find() method, as the split() method don't seem to work with the zero-width look (ahead/behind) in all cases.

Henry
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!