• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Generate text from PDF.

 
accnit Jai
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a requirement where I have to convert the PDF document to HTML5. I do not want to use any available tool achieve this. I want to write my own code to achieve this. Being java developer I have started with iText but I saw that, iText just extract the text from PDF and does not keep the formatting layout on PDF.

Can someone please guide which API i should use to achieve this? below is my high level requirement.

1-Extract the text from the PDF without loosing formatting layout.

2-extract the images if any.

3-Retain the formatting in the newly converted HTML5 page same as that of PDF page.

Thanks in Advance.
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm confused - you do not want to use any available tool (why? PDF is hugely complicated, do you really want to write all that code yourself?), but you considered using iText? There's a disconnect that you need to resolve for us before we can usefully recommend an approach.

AFAIK there is no free tool to convert PDF to anything that keeps the formatting. You can use the PDFRenderer project as a basis - it can display PDFs in Swing, so obviously it knows what to do with the formatting information.
 
accnit Jai
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Ulf , Sorry for confusion. what I meant that, I do not want to use any paid software. I am looking for any open source java API. I wrote the program by using the iText, but it just extract text from PDF.
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I said, I'm unaware of any free tool that extracts layout information from PDFs. If you are prepared to put a lot of work into it, you can go the route I suggested with the PDFRenderer source code.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic