hi all,
i am able to convert a pdf in to a text file using pdfbox.
and this is the code that I used, but I am not able to index it
<b>// code for parsing and making index</b>
public Document getDocument(InputStream is)
{
COSDocument cosDoc = null;
try {
PDFParser parser = new PDFParser(is);
parser.parse();
cosDoc = parser.getDocument();
}
catch (IOException e) {
e.printStackTrace();
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
}
catch (IOException e) {
e.printStackTrace();
}
Document
doc = new Document();
if (docText != null) {
doc.add(new Field("body", docText, Field.Store.YES,
Field.Index.TOKENIZED));
}
return doc;
}
public static void main(String[] args) throws Exception {
TestPDFParser handler = new TestPDFParser();
Document doc = handler.getDocument(new FileInputStream(new File("D:\\lucenePdf\\DRra0026.pdf")));
System.out.println(doc);
<b>//Following code is for making index</b>
IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new StandardAnalyzer(), true);
f_writer.addDocument(doc);
}
}
<b> //code for searching a particular string..</b>
public static void main(String[] args) throws Exception {
String indexDir = "D:\\lucenePdf";
String q = "RA0083";
Directory fsDir = FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);
Query query = new QueryParser("body", new StandardAnalyzer()).parse(q);
Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " documents that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
}
}
<b>When I run the above code...I get folowing output as a result of running indexer class</b>
Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT : RA0083
000099062000062000000021000000100220468148001102006PAYOUT : RA0083
000099062000063000000021000000100330468153601102006PAYOUT : RA0083
000099062000064700000021000000100440468155401102006PAYOUT : RA0083
000099062000065700000021000000100550468156201102006PAYOUT : RA0083
<b>and following files are generated in the specified path..</b>
segments.gen
write.lock
segments_4
<b>but when I run the search class it gives the result as:</b>
<b>Found 0 documents that matched query 'RA0083':</b>
It seems as the index is not getting created..
Please help me with some of your inputs,it will be very helpfull for me.