Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

text parsing  RSS feed

 
Chris Montgomery
Ranch Hand
Posts: 141
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to write a little "engine" which parses the text and filters out the words I don't want (like "the", "or", "and", etc.).

Ultimately, it will become a search engine and I've heard of Lucene, but my gut tells me this is overkill (definitely open to hearing opinions on this).

For now, I'm more concerned with how to efficiently obtain the useful parts of the entered text while filtering out the unwanted text.

I found one example using google, but not much else (Java Practices: Parse text)

Is this worth using?
Should I use one class over the other to parse my details?


Thanks!
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Lucene is relatively easy to use. I use it with another library called highlighter that formats the results. I'm a little less than thrilled with their algorithms for multi-word searches but it's certainly better than I could ever do.
 
Chris Montgomery
Ranch Hand
Posts: 141
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My original post was quite vague in terms of what I would be searching (sorry)...

I will be searching data stored in my database (mysql).

Lets say I have a table called books. In this table I have a column for bookTitle and bookDescription. My goal would be to find all books that contain the words "car","tires" and "doors". Lucene looks to be for text found in directories/files - true?

I'm comfortable retrieving the data once it's in the db.

My biggest concern is how to parse the text when the description is originally entered. I�m looking for a some efficient options that I can test. The descriptions could be potentially long and the frequency may be quite high.
[ October 23, 2005: Message edited by: Chris Montgomery ]
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could do this with a SQL statement:

I don't know how efficient this is, but you can quickly build a SQL statement with as man search words as you like. You can even make more complex searches using AND as well as OR.

Layne
 
Chris Montgomery
Ranch Hand
Posts: 141
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually, you don't utilize your indexes with the approach you provided (fyi ). Could be quite painful if run on tables with large row counts.

As I indicated in my previous post, I'm firm with how to gather that data once it's in the db.

My focus is on effectively and efficiently gathering the inputted text.

Someone is going to submit a text description that could potentially be 5000 characters in length (or more). I want to sift through the text and take only the words I want and discard the rest.

What class is most recommended for parsing and sifting through large amounts of text?
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Lucene indexes Document objects. You can make a document from a String of content pretty easily. The search results will include the name & path of the document which can include a PKey to go find it in the db again. Here's how I do it in one program:


BTW: To build your own index or your own list of words to ignore, look into the Ternary Search Tree. The applet in this article takes a while to load but runs lookups against a largish dictionary like lightning.
[ October 24, 2005: Message edited by: Stan James ]
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!