I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.
I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.
I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:
Thanks for any help guys.
Do you want ants? Because that's how you get ants. And a tiny ads: