• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Bear Bibeault
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Frits Walraven
Bartenders:
  • Carey Brown
  • salvin francis
  • Claude Moore

Exporting text from PDF to Word while preserving style information  RSS feed

 
Greenhorn
Posts: 2
Java Linux Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
PDF to Word,preserve style,Who can help me or give me some advice?
 
Rancher
Posts: 1170
18
Firefox Browser Hibernate IntelliJ IDE Java MySQL Database Spring Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to the ranch
You can use apache poi or pdfBox
 
Saloon Keeper
Posts: 5412
143
Android Firefox Browser Mac OS X Safari Tomcat Server VI Editor
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The basic problem is that it is very hard to extract style information from a PDF. POI can create Word files, but has no PDF capabilities. PDFBox can extract text from a PDF, but has no easy API to extract style information.

The PDFRenderer project on GitHub can display PDFs, so obviously it knows how to extract styles. You could check what it does and try to do the same. Be prepared for much work.

The bottom line is that this will be a lot of work, and I predict that you will not get it to work. So let's take a step back and ask: why do you think you need to do it?
 
Daniel Demesmaecker
Rancher
Posts: 1170
18
Firefox Browser Hibernate IntelliJ IDE Java MySQL Database Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You should be able to extract the styles with dpfbox, I used it before to create a buffered image and show it in an imageview, the same logic could be used to place the buffered image in a word file.
Poi I have used before to create excel sheets that contained styling and images, although there I believe I had to create the style myself, it wasn't automaticly copied
 
Saloon Keeper
Posts: 20655
122
Android Eclipse IDE Java Linux Redhat Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You have two different document formats here: Word is a word processor, PDF is text layout.

They may look a lot alike, but there's a very big difference.

In a PDF, the page layout is fixed (mostly - there's a format called "reflowable PDF"). The sizes and positions of everything are quite firmly nailed down onto each page of the document.

In a Word document, the page layout is more fluid, as anyone who's taken a Word document and moved it to a different computer - or even opened it with a different word-processor such as LibreOffice - can attest.  Paragraphs slide around from page to page. Fonts don't always match (this used to be a major problem before FreeType).

So at a minimum, expect to lose some things when converting.

A PDF is a set of metadata combined with a series of PostScript commands. A Word document can be represented using Rich Text Format in a plain text file, or in traditional .doc format or in the XML-based .docx format. The command set is the same, regardless, so only the notation varies.

There are some websites that claim to be able to convert PDF's or PostScript to Word format. Personally, I prefer something I can run in-house, for security reasons. But I haven't been able to actually find anything like that.

Linux has a very rich set of document-processing tools, so it's possible I could pipe a few of them together to do what you want, but no immediate solution comes to mind.
 
Shi Asong
Greenhorn
Posts: 2
Java Linux Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hank you for your answers. I don't need to think about it for the time being, because the boss doesn't
 
We've gotta get close enough to that helmet to pull the choke on it's engine and flood his mind! Or, we could just read this tiny ad:
Create Edit Print & Convert PDF Using Free API with Java
https://coderanch.com/wiki/703735/Create-Convert-PDF-Free-Spire
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!