I have 5000 PDFs and an identifier (lets say account number) is embedded at page number 2 for each PDF....I have to efficiently read the account number from page 2 in each and every PDF so I used iText API to scan the page 2 and got the account number ..since there are 5000 PDF files ,the performance is very bad..and it took more than 5 hours to read the account numbers from all the PDF files ..obviously because of operational cost to open , scan pdf and read identifier from 5000 files...
I need help with the ideas which could help me to resolve the performance issue....
Well the obvious solution would be to put the account number somewhere else where you CAN easily read it.
An 'index' of sorts, so that given an account number, you can identify the correct PDF without having to open and read it every time.
You will of course have to read it at least once :-)
Maybe a database table. Maybe rename the PDF file with the account number in some way.
You end up doing a lot of work one time to set up the index, but thereafter it should be efficient to locate the documents by account number at least.
If the identifier is always found at predictable coordinates on page 2, you can try extracting only a portion of the text from a small rectangular region around those coordinates.
I don't really know if it'll help with performance, but I guess you can try a quick prototype and measure timing.
Something like this:
The rectangle coordinates can be calculated by taking page size (such as 8.5"x11") and converting them into pixels using the dpi (usually 1" = 72 dots).
This info is usually available in the PDF itself - just open with any PDF viewer and check properties. I don't remember right now whether 0,0 is lower left or upper left, but you can find it out experimentally.
PDFBox too has similar and somewhat simpler rectangular extraction API, and in fact I prefer PDFbox to iText for such extraction. You can prototype that too and see if it helps.
As for general performance tips, try to multithread so that every CPU core is processing a PDF concurrently.
posted 4 years ago
Karthik , Yes identifier position is constant..let me try with PDFBox..If performance is better then we can try to use this ..otherwise indexing is last option...
once again..thank you both for reply...
SCJP , SCWCD , TOGAF 9 Part1 & Part2
I didn't like the taste of tongue and it didn't like the taste of me. I will now try this tiny ad: