I am going to develop a software which based on searching key-words for a word document file (doc,docx) and Excel file (xls,xlsx) both (This files has information about people). I have found out Apache Solr which is used for searching keywords.
The situation is the user will insert some crieteria such as name,phone-number etc and the software should give the resultant list of files which has those keywords.
Any thing else can be used for the same situation?
It's hard to make a recommendation without knowing the specifics, but I don't see any benefit in involving a DB, especially as you'd still need to index and search. So I'd recommend using Lucene and Tika. Thousands of documents is a small number that Lucene can easily handle.
Thankyou so much for the advice. I have downloaded the JAR for Apache Tika and integrated into my ecclipse project. I didnt get any proper example for parsing and searching for a word document.
and also didnt understand that what purpose is Lucene used for?
If you search for "Lucene introduction" you're bound to much useful material.
The Lucene download also comes with code examples that demonstrate the mechanics of indexing and searching, described right here. I would expect it to take a couple of hours of reading and experimenting with those in order to gain a basic understanding of how Lucene operates.
Hiee again thanks for the link and help.
Till now i have learnt that with the use of Apache Tika we parse the files and with the use of Apache Lucene files are been indexed and searching is also done.
Am I going in right direction?
I parsed a a word file through Tika and read something about lucene indexing and searching.
But does Apache Tika gives the parse data in a file format?
It is not for anyone here to tell you how to architect your solution. There may be aspects to it that make creating an intermediate file a good approach. It sounds like a rather roundabout way to me, though. What problem do you see with using the Lucene API directly instead of creating an intermediate file - which you would have to index then anyway?
Starting from the begining.
I read the links that you gave for Apache Lucine. According to that i created CLASSPATH for required Jars.
Then I made a folder in Lucine called "src" and placed two doc files and a excel file.
Then I continued with the tutorial and indexed all the three files with the command from the command prompt. Then searched for a keyword which is in all the three files.
But Lucine shows the result of that keyword in excel file only (but not in the doc files).
And some keywords which exist in doc files but they are not shown in search results on searching that keywors through Lucine.
So is it the right way?
Where should I use Apache Tika in this process?? Should I parse the files in src folder throguh tika?
I am actually not getting the direction in which I can use both the things (Apache Tika and Apache Lucine) together. I have made Individual demo for both.
Ulf Dittmer wrote:Yes, you would use Tika to index structured file formats like the Office formats. As you have found out, Lucene without Tika can't be used to index those, it can only index text files.
Thankyou so much for your help. I will parse the files from Tika and pass that String into Lucine.
I have done some exercise on the above discussed things.
Index class of Lucine takes directory path as its argument and then indexes the files available in it.
So with the help of Tika i have parsed the (office)files and the created different text files for the String which it returns (one text file for every office file). And then i will place that into the directory which Lucine uses for parsing.