Win a copy of Spring in Action (5th edition) this week in the Spring forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Devaka Cooray
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Knute Snortum
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Ganesh Patekar
  • Frits Walraven
  • Tim Moores
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Stephan van Hulst
  • salvin francis
  • Tim Holloway

Generate text from PDF.  RSS feed

 
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a requirement where I have to convert the PDF document to HTML5. I do not want to use any available tool achieve this. I want to write my own code to achieve this. Being java developer I have started with iText but I saw that, iText just extract the text from PDF and does not keep the formatting layout on PDF.

Can someone please guide which API i should use to achieve this? below is my high level requirement.

1-Extract the text from the PDF without loosing formatting layout.

2-extract the images if any.

3-Retain the formatting in the newly converted HTML5 page same as that of PDF page.

Thanks in Advance.
 
Rancher
Posts: 42974
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm confused - you do not want to use any available tool (why? PDF is hugely complicated, do you really want to write all that code yourself?), but you considered using iText? There's a disconnect that you need to resolve for us before we can usefully recommend an approach.

AFAIK there is no free tool to convert PDF to anything that keeps the formatting. You can use the PDFRenderer project as a basis - it can display PDFs in Swing, so obviously it knows what to do with the formatting information.
 
accnit Jai
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Ulf , Sorry for confusion. what I meant that, I do not want to use any paid software. I am looking for any open source java API. I wrote the program by using the iText, but it just extract text from PDF.
 
Ulf Dittmer
Rancher
Posts: 42974
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I said, I'm unaware of any free tool that extracts layout information from PDFs. If you are prepared to put a lot of work into it, you can go the route I suggested with the PDFRenderer source code.
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!