• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Ron McLeod
  • paul wheaton
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
  • Himai Minh
Bartenders:

Extracting text with formatting using PDFBox

 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys,

I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.

I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.

I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:



Thanks for any help guys.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic