• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Tim Cooke
  • Devaka Cooray
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
Bartenders:
  • Carey Brown
  • Roland Mueller

getting problem while indexing pdf files with pdfbox with lucene

 
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hi all,

i am able to convert a pdf in to a text file using pdfbox.
and this is the code that I used, but I am not able to index it

<b>// code for parsing and making index</b>

public Document getDocument(InputStream is)
{
COSDocument cosDoc = null;
try {
PDFParser parser = new PDFParser(is);
parser.parse();
cosDoc = parser.getDocument();
}
catch (IOException e) {
e.printStackTrace();
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
}
catch (IOException e) {
e.printStackTrace();
}
Document doc = new Document();
if (docText != null) {
doc.add(new Field("body", docText, Field.Store.YES,
Field.Index.TOKENIZED));
}
return doc;
}

public static void main(String[] args) throws Exception {
TestPDFParser handler = new TestPDFParser();

Document doc = handler.getDocument(new FileInputStream(new File("D:\\lucenePdf\\DRra0026.pdf")));

System.out.println(doc);

<b>//Following code is for making index</b>

IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new StandardAnalyzer(), true);

f_writer.addDocument(doc);

}
}
<b> //code for searching a particular string..</b>

public static void main(String[] args) throws Exception {
String indexDir = "D:\\lucenePdf";
String q = "RA0083";


Directory fsDir = FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);

Query query = new QueryParser("body", new StandardAnalyzer()).parse(q);

Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " documents that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);

}
}


<b>When I run the above code...I get folowing output as a result of running indexer class</b>

Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT : RA0083
000099062000062000000021000000100220468148001102006PAYOUT : RA0083
000099062000063000000021000000100330468153601102006PAYOUT : RA0083
000099062000064700000021000000100440468155401102006PAYOUT : RA0083
000099062000065700000021000000100550468156201102006PAYOUT : RA0083

<b>and following files are generated in the specified path..</b>

segments.gen
write.lock
segments_4


<b>but when I run the search class it gives the result as:</b>

<b>Found 0 documents that matched query 'RA0083':</b>


It seems as the index is not getting created..
Please help me with some of your inputs,it will be very helpfull for me.
 
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would first download Luke, from the Lucene webside. You can check your index that way, and maybe make sure the string you are searching for is actually indexed that way.

Also, PDDocument is maybe a bit easier to implement than using COS.
 
neetika sharma
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thanks Kail for the suggestion of using PDDocument,

I am able to get the result fine.
The problem was, I forgot to close the writer and so the index file (.cfs) was not getting generated.

Thanks
 
what if we put solar panels on top of the semi truck trailer? That could power this tiny ad:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic