• Post Reply Bookmark Topic Watch Topic
  • New Topic

Need some help on searching keywors from doc and xls  RSS feed

 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello all,

I am going to develop a software which based on searching key-words for a word document file (doc,docx) and Excel file (xls,xlsx) both (This files has information about people). I have found out Apache Solr which is used for searching keywords.

The situation is the user will insert some crieteria such as name,phone-number etc and the software should give the resultant list of files which has those keywords.

Any thing else can be used for the same situation?

thanks
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Solr is a standalone server - is that what you want? If not, a combination of Apache Lucene (for indexing and searching) and Apache Tika (for helping Lucene index Office files) might be the way to go.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thakyou for responding soon,

Actually I have to integrate it in Java web app. And I have to search with more that one crieteria's(may be on 50 crieteria's) in more that 1000's of documents.

We have thought to implemented that two ways :

1) Saving the data of office files in Database and then making a search on that.
2) Using Apache Lucine and Tika.

Which do you think is more faster and efficient..?

Thanks.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's hard to make a recommendation without knowing the specifics, but I don't see any benefit in involving a DB, especially as you'd still need to index and search. So I'd recommend using Lucene and Tika. Thousands of documents is a small number that Lucene can easily handle.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thankyou so much for the advice. I have downloaded the JAR for Apache Tika and integrated into my ecclipse project. I didnt get any proper example for parsing and searching for a word document.
and also didnt understand that what purpose is Lucene used for?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you search for "Lucene introduction" you're bound to much useful material.

The Lucene download also comes with code examples that demonstrate the mechanics of indexing and searching, described right here. I would expect it to take a couple of hours of reading and experimenting with those in order to gain a basic understanding of how Lucene operates.

For Tika, some starting code can be found at http://tika.apache.org/1.4/parser_guide.html
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hiee again thanks for the link and help.
Till now i have learnt that with the use of Apache Tika we parse the files and with the use of Apache Lucene files are been indexed and searching is also done.
Am I going in right direction?

I parsed a a word file through Tika and read something about lucene indexing and searching.

But does Apache Tika gives the parse data in a file format?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, it produces strings that you would feed into Lucene's indexing API.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have indexed two doc files and one xls file in Lucene. Some keywords are searched and some are not.
I have also parsed a document through Tika which returns a String.

Now do you want me to create a text file through that string and index that in Lucene and then search for keywords?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It is not for anyone here to tell you how to architect your solution. There may be aspects to it that make creating an intermediate file a good approach. It sounds like a rather roundabout way to me, though. What problem do you see with using the Lucene API directly instead of creating an intermediate file - which you would have to index then anyway?
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Starting from the begining.
I read the links that you gave for Apache Lucine. According to that i created CLASSPATH for required Jars.
Then I made a folder in Lucine called "src" and placed two doc files and a excel file.

Then I continued with the tutorial and indexed all the three files with the command from the command prompt. Then searched for a keyword which is in all the three files.
But Lucine shows the result of that keyword in excel file only (but not in the doc files).
And some keywords which exist in doc files but they are not shown in search results on searching that keywors through Lucine.

So is it the right way?
Where should I use Apache Tika in this process?? Should I parse the files in src folder throguh tika?
I am actually not getting the direction in which I can use both the things (Apache Tika and Apache Lucine) together. I have made Individual demo for both.

Please help..

Thankyou.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, you would use Tika to index structured file formats like the Office formats. As you have found out, Lucene without Tika can't be used to index those, it can only index text files.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:Yes, you would use Tika to index structured file formats like the Office formats. As you have found out, Lucene without Tika can't be used to index those, it can only index text files.


Thankyou so much for your help. I will parse the files from Tika and pass that String into Lucine.
Thanks again.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One last question,

I have done some exercise on the above discussed things.
Index class of Lucine takes directory path as its argument and then indexes the files available in it.

So with the help of Tika i have parsed the (office)files and the created different text files for the String which it returns (one text file for every office file). And then i will place that into the directory which Lucine uses for parsing.

Am I going in the right way?

thanks
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I said before, I think it's needlessly complicated to create an intermediate file instead of feeding the text directly into the indexing code.
 
Ishan Pandya
Ranch Hand
Posts: 228
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You mean passing the String directly into this method's "String article" argument.



right?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If that is the method that uses the Lucene API to add Documents to the index, then - yes.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!