• Post Reply Bookmark Topic Watch Topic
  • New Topic

text Summarization API  RSS feed

 
pradeep u nair
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi friends,
I am developing a java application where i need to extract text content from web pages and then summarize it based on a keyword given by the user.I have extracted the text content from web pages but i need to summarize it based on keyword given.Is there any java tools available which can help me sort this problem or someone can send me some code which converts the text to bits of text.
thanking u in advance
Pradeep
 
Jan Groth
Ranch Hand
Posts: 456
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
no easy way to achieve this, sounds like you need a search engine, which indices the text for you.

btw: if not a must, you can save the detour to extract the text from the webpage...

try lucene - lucene.apache.org


regards,
jan
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not aware of a text summarization API in Java. Lucene lets you index and search text, but it does not address summarization. I'm also not sure what you mean by "summarize it based on a keyword" - do you want to extract those parts of the text that deal with that particular keyword?
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to parse the text into units that make sense to humans - phrases, sentences and paragraphs. Next score those units according to the presence of keyword(s), now select the best of the units that are "hits" according to typical writing principles and the size of the summary you are aming for.

What do I mean about writing principles? Think about how you yourself scan text.
For example you expect the first sentence of a paragraph to be meaningful in terms of the content of that paragraph. You expect a good chance that the last paragraph of an article to summarize the article.

In the prehistoric era of computers (showing my age now) there was an indexing technique called KWIC - Key Word In Context. It created a listing with the n words preceeding a key word plus the n words following. This put a burden on the reader to recognize a significant context versus a trival one.

This is a topic of continued interest to me, let us know what you come up with.
Bill
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!