• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

Reading contents of PDF in JAVA

 
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,
Can anyone suggest me an API for reading the contents of PDF file effectively?

Thanks in advance
Senthil Kumar.S
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That depends of what you mean by "effectively". The text that's part of a PDF may be extracted by libraries such as PDFBox, JPedal and PDFTextInputStream. You can find links to these in the http://faq.javaranch.com/java/AccessingFileFormats FAQ page.

If you are talking about the layout information, then that's not possible.
 
Ranch Hand
Posts: 1970
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The FAQ says "PDF is a hard-to-read format". In fact, it's not outrageously difficult to read. And it is a properly documented standard. The difficulty is that people often imagine that they will be able to "convert" a PDF file to some other format that has a different purpose.

A PDF file contains a document for display and/or printing. It contains instructions (very much like compiled PostScript) to draw lines, shade areas, write text etc, at various places on the page. It does not contain much, if any, metadata about the relationships and purposes of these lines, areas and text. In this respect, it is very different to things like HTML, Word documents, RTF etc.

As an example, say you have a PDF file that, when displayed, shows a table of values. Nothing in the PDF file says it's a table. It's just a load of lines and text in various places. Therefore, it is nearly impossible for a general program to identify the PDF as a table and convert it into, say, an OpenOffice Document containing a table.

(To bartenders: I've written a number of similar replies recently. Any chance of improving the FAQ entry? I work for a company one of whose main businesses is PDF, so I'm sure I can get a nice concise entry for you.)
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I've written a number of similar replies recently. Any chance of improving the FAQ entry?


Absolutely. If you don't feel like messing with the wiki yourself, send me whatever you come up with, and I'll put it in there.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic