• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

PDF parsing

 
Greenhorn
Posts: 26
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
I searched the forum, googled, but couldn't find the answer.
For my current project I need library for pdf parsing. I need to extract text, images, bookmarks, annotations and security information. I tried pdfbox and itext, but both seam to have problems with custom font encoding. Non English characters are corrupted.
There is no standart template or creating tool for pdfs my program receive so this problem is quite essential.
Please recommend what library should be use, that this could be solved?
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jina Lu wrote:I tried pdfbox and itext, but both seam to have problems with custom font encoding. Non English characters are corrupted.


Are you sure that is because of bugs or lack of support in those libraries, or was it just a bug in how your program handled and displayed the data?
 
Jina Lu
Greenhorn
Posts: 26
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jesper, surely it is related to encoding, but also might be that I'm missing something in code, to fix that. I tried with different pdf files. If encoding is Identity-H, Ansi any not custom, I'm getting correct output.

PDFBOX:


iText:
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Offhand I would also assume that the problem with non-ACII text are not intrinsic to the libraries you're using. You're not printing the characters to a console or terminal or a flat file that only supports ASCII; right?

But more importantly, I don't think there's an easy solution, and certainly not an easy free solution, to the underlying problem. if the two libraries you mention can extract all you need - great, but if not then it gets a lot harder. PDF-Renderer can display PDFs, so obviously it knows a lot about PDF internals; maybe that could be a starting point.
 
Jina Lu
Greenhorn
Posts: 26
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for reply, Ulf Dittmer, parsing result is String which I output to text file using commons-io: FileUtils.writeStringToFile(new File("resulr.txt"), text);
But I debugged and I see corrupted characters in String.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jina Lu wrote:But I debugged and I see corrupted characters in String.


How, exactly, did you do that?
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jina Lu wrote:Thanks for reply, Ulf Dittmer, parsing result is String which I output to text file using commons-io: FileUtils.writeStringToFile(new File("resulr.txt"), text);
But I debugged and I see corrupted characters in String.


I think Ulf is right: the problem you're trying to solve, or more specifically, the level of detail you're trying to solve it at, may be an issue. PDF was a proprietary format until 2008, and although it has since been made a published standard under ISO-32000 (you can find a copy here), that doesn't mean that Adobe have made any effort to make it "user-friendly" (it's 756 pages, just in case you're interested).

One thing that seems clear though (If you look at sections 9.1 and 9.7 in the link I provided), is that processing - especially of the 'Tj' tag - is very different if the font is not one of the 14 "standard" fonts, so it's quite possible that the libraries you mention only provide limited text extraction capability; you'd have to read their manuals to find out. Certainly you can store non-Ascii text characters in a Java String though.

Another possibility might be to convert the document to something like Lucene or OpenOffice and then use that tool to extract the information you want. Both have extensive Java libraries for parsing their own doc formats and one would assume, since they're in the business of document processing, that they've made their conversion utility as comprehensive as they can. How far it will translate security information though, I have no idea - Adobe may still protect that sort of stuff.

Best of luck.

Winston
 
I'm THIS CLOSE to ruling the world! Right after reading this tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic