Jina Lu wrote:Thanks for reply, Ulf Dittmer, parsing result is String which I output to text file using commons-io: FileUtils.writeStringToFile(new File("resulr.txt"), text);
But I debugged and I see corrupted characters in String.
I think Ulf is right: the problem you're trying to solve, or more specifically, the level of detail you're trying to solve it at, may be an issue. PDF was a proprietary format until 2008, and although it has since been made a published standard under ISO-32000 (you can find a copy
here), that doesn't mean that Adobe have made any effort to make it "user-friendly" (it's 756 pages, just in case you're interested).
One thing that seems clear though (If you look at sections 9.1 and 9.7 in the link I provided), is that processing - especially of the 'Tj' tag - is very different if the font is not one of the 14 "standard" fonts, so it's quite possible that the libraries you mention only provide limited text extraction capability; you'd have to read their manuals to find out. Certainly you can store non-Ascii text characters in a
Java String though.
Another possibility might be to convert the document to something like Lucene or OpenOffice and then use
that tool to extract the information you want. Both have extensive Java libraries for parsing their own
doc formats and one would assume, since they're in the business of document processing, that they've made their conversion utility as comprehensive as they can. How far it will translate
security information though, I have no idea - Adobe may still protect that sort of stuff.
Best of luck.
Winston