File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other JSE/JEE APIs and the fly likes Generate text from PDF. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Other JSE/JEE APIs
Bookmark "Generate text from PDF." Watch "Generate text from PDF." New topic
Author

Generate text from PDF.

accnit Jai
Ranch Hand

Joined: Feb 15, 2011
Posts: 33
I have a requirement where I have to convert the PDF document to HTML5. I do not want to use any available tool achieve this. I want to write my own code to achieve this. Being java developer I have started with iText but I saw that, iText just extract the text from PDF and does not keep the formatting layout on PDF.

Can someone please guide which API i should use to achieve this? below is my high level requirement.

1-Extract the text from the PDF without loosing formatting layout.

2-extract the images if any.

3-Retain the formatting in the newly converted HTML5 page same as that of PDF page.

Thanks in Advance.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42944
    
  68
I'm confused - you do not want to use any available tool (why? PDF is hugely complicated, do you really want to write all that code yourself?), but you considered using iText? There's a disconnect that you need to resolve for us before we can usefully recommend an approach.

AFAIK there is no free tool to convert PDF to anything that keeps the formatting. You can use the PDFRenderer project as a basis - it can display PDFs in Swing, so obviously it knows what to do with the formatting information.
accnit Jai
Ranch Hand

Joined: Feb 15, 2011
Posts: 33
Thanks Ulf , Sorry for confusion. what I meant that, I do not want to use any paid software. I am looking for any open source java API. I wrote the program by using the iText, but it just extract text from PDF.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42944
    
  68
As I said, I'm unaware of any free tool that extracts layout information from PDFs. If you are prepared to put a lot of work into it, you can go the route I suggested with the PDFRenderer source code.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Generate text from PDF.