Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extracting text with formatting using PDFBox

 
Mattie James
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi guys,

I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.

I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.

I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:



Thanks for any help guys.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic