Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extracting sentences from a text file

 
Ayan Biswas
Ranch Hand
Posts: 104
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to write a program that will extract sentences from a text file.If I use '.' as a delimiter and separate the text by it then each acronyme becomes a sentence!!How to solve this problem?
 
Henry Wong
author
Marshal
Pie
Posts: 21405
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ayan Biswas wrote:I need to write a program that will extract sentences from a text file.If I use '.' as a delimiter and separate the text by it then each acronyme becomes a sentence!!How to solve this problem?



One option is to further qualify your definition of what is a sentence. For example, if a sentence must be longer than one word, or longer than two letters, wouldn't that take care of your false positives from acronyms?

Henry
 
Ayan Biswas
Ranch Hand
Posts: 104
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One option is to further qualify your definition of what is a sentence. For example, if a sentence must be longer than one word, or longer than two letters, wouldn't that take care of your false positives from acronyms?


here is the problem if i follow the instructions.
suppose the sentence is like this "<some text> U.S.A<some text>".Problem will persist in that case
 
Ayan Biswas
Ranch Hand
Posts: 104
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
some text "U" ,will be the first sentence."S" will be the next sentence(which I can append to "U" as word count =1) and "A" some text will be the last sentence.so,problem persists in the last sentence.
 
Rob Spoor
Sheriff
Pie
Posts: 20608
63
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your definition of sentence end is not correct. A sentence doesn't necessarily end in a dot (or question mark, exclamation mark, etc). You could regard the end of a sentence a dot, question mark or exclamation mark but only if it is followed by whitespace (space, enter, tab, etc) or nothing at all (end of String). This is the approach that Javadoc also uses.

That's still flawed however, as the sentence would end with U.S.A. even if there's something after it. Javadoc also has this problem; I've seen several Javadoc comments in the summary list end with "i.e.". We need to redefine what a sentence end is. You can expand the previous definition to include that the next word should start with an uppercase letter. However, that will still be incorrect if you have a name or something other with an uppercase letter after an acronym. It becomes evident that full sentence recognition is still not trivial (or even possible?) to do from code.
 
Henry Wong
author
Marshal
Pie
Posts: 21405
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is why my response was to further qualify your definition of what is a sentence -- and the rest of the response was just examples.

Only the OP knows the exact definition of what is a sentence, and hence, able to correctly qualify it. Now, of course, if the definition is as used in any generic text, then it is very difficult, if not impossible.

Henry
 
Ayan Biswas
Ranch Hand
Posts: 104
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for all the replies.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic