• Post Reply Bookmark Topic Watch Topic
  • New Topic

Extracting text with formatting using PDFBox  RSS feed

Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi guys,

I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.

I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.

I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:

Thanks for any help guys.
Do you want ants? Because that's how you get ants. And a tiny ads:
ScroogeXHTML 7.2 - RTF to HTML5 / XHTML converter
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!