Win a copy of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 this week in the Programmer Certification forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Paweł Baczyński
  • Piet Souris
  • Vijitha Kumara

Read PDF content on Specific page

 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello All,

I have 5000 PDFs and an identifier (lets say account number) is embedded at page number 2 for each PDF....I have to efficiently read the account number from page 2 in each and every PDF so I used iText API to scan the page 2 and got the account number ..since there are 5000 PDF files ,the performance is very bad..and it took more than 5 hours to read the account numbers from all the PDF files ..obviously because of operational cost to open , scan pdf and read identifier from 5000 files...

I need help with the ideas which could help me to resolve the performance issue....



Thanks in advance
 
Bartender
Posts: 1845
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well the obvious solution would be to put the account number somewhere else where you CAN easily read it.
An 'index' of sorts, so that given an account number, you can identify the correct PDF without having to open and read it every time.
You will of course have to read it at least once :-)

Maybe a database table. Maybe rename the PDF file with the account number in some way.

You end up doing a lot of work one time to set up the index, but thereafter it should be efficient to locate the documents by account number at least.
 
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the identifier is always found at predictable coordinates on page 2, you can try extracting only a portion of the text from a small rectangular region around those coordinates.
I don't really know if it'll help with performance, but I guess you can try a quick prototype and measure timing.
Something like this:



The rectangle coordinates can be calculated by taking page size (such as 8.5"x11") and converting them into pixels using the dpi (usually 1" = 72 dots).
This info is usually available in the PDF itself - just open with any PDF viewer and check properties. I don't remember right now whether 0,0 is lower left or upper left, but you can find it out experimentally.

PDFBox too has similar and somewhat simpler rectangular extraction API, and in fact I prefer PDFbox to iText for such extraction. You can prototype that too and see if it helps.

As for general performance tips, try to multithread so that every CPU core is processing a PDF concurrently.

 
subodh kureel
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you....

Karthik , Yes identifier position is constant..let me try with PDFBox..If performance is better then we can try to use this ..otherwise indexing is last option...

once again..thank you both for reply...
 
I didn't like the taste of tongue and it didn't like the taste of me. I will now try this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!