This week's book giveaway is in the Reactive Progamming forum.
We're giving away four copies of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams and have Adam Davis on-line!
See this thread for details.
Win a copy of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams this week in the Reactive Progamming forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Convert .DOC to .PDF

 
Ranch Hand
Posts: 1948
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a need in a web application to convert a .DOC file to a .PDF file.
Also I would like to have the ability to take a .PDF file and insert it into another .PDF file.

Example: I have two PDF files. The first is a drawing file with multiple pages. The second is a single page file.
I need to insert the second .PDF just after page 1 of the first .PDF file.

I would like to do this with Open Source Code.

All the references I have found on the net seem to be out dated. I have PDFBox in my application now also iText.

Can someone guide me to the best and most correct solution?

Thanks for your consideration.
 
Saloon Keeper
Posts: 5805
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
PdfBox would be the best tool for inserting and/or merging PDF operations, IMO. See https://pdfbox.apache.org/2.0/commandline.html to get started.

Converting is tough. I'm not aware of a general solution using free tools, but there is no shortage of online tools; maybe some of those have an API. A decent option might be to use LibreOffice's headless mode; see https://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/ for more information on that.
 
Steve Dyke
Ranch Hand
Posts: 1948
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:

Converting is tough. I'm not aware of a general solution using free tools, but there is no shortage of online tools; maybe some of those have an API. A decent option might be to use LibreOffice's headless mode; see https://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/ for more information on that.



If we were to purchase a tool that could be integrated into our web application running on an iseries Websphere application server, what would you recommend?
 
Steve Dyke
Ranch Hand
Posts: 1948
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:PdfBox would be the best tool for inserting and/or merging PDF operations, IMO. See https://pdfbox.apache.org/2.0/commandline.html to get started



In this article I can see merging but I really need to take a multipage PDF file(parent) and insert a PDF file(child) after page 1 of the parent file. These are PDF files containing text, graphics, and images.
 
Tim Moores
Saloon Keeper
Posts: 5805
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Steve Dyke wrote:In this article I can see merging but I really need to take a multipage PDF file(parent) and insert a PDF file(child) after page 1 of the parent file.


That's really the same from a PDF programming point of view: both are about taking PDF pages from one or more sources and adding them to another PDF.

what would you recommend?


I haven't used any of the commercial tools, so I can't recommend one.
 
Sheriff
Posts: 24654
58
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just spent a minute looking at the PDFBox documentation -- I was curious to see if it was possible to extract a particular page from a PDDocument. Your use of the getNumberOfPages() method in your other parallel post suggested to me that it might be possible. Turns out it is possible.
 
Saloon Keeper
Posts: 21126
131
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Linux platform has a whole raft of PDF creation and manipulation tools. Working with text processors is one of my specialties so I use many of them on a fairly frequent basis. You can insert, delete and re-arrange pages, split and merge documents, add/remove document properties - all sorts of stuff.

Getting a DOC file into PDF is a bit more complex, and my usual go-to on that is headless LibreOffice. Do realize, however, that there is a fundamental difference between DOC and PDF. Programs like MS-Word and LibreOffice Writer are word processor programs. They allow creating and editing of formatted text. PDF's however are basically typeset documents.

What's the difference? Well, a typeset document - which originally was something you'd create with a page layout program such as Aldus PageMaker - has every element on the page nailed down to a very specific position (and in the case of text, a very specific font). Word processors are more free-form and they work with the constraints of the system that they are running on.

What this means is that a PDF will render exactly the same on any system anywhere. Word documents, however, will not. This is a common complaint by the ignorant about open-source word processors such as LibreOffice. Text imported from Word doesn't look the same, page breaks may move, etc. In actuality, this self-same problem occurs when moving from one MS-Word system to another as well. The font metrics used for page layout are actually obtained from the currently-selected printer driver, thus the page layout will vary based on that the printer driver tells it.

It's not as noticeable these days, since most of us use soft fonts. Back in the previous millennium, however, soft fonts were mostly on Mac systems, and HP printers (as a popular example) came with a half-dozen hard fonts. Which usually weren't even scalable. The office I worked in in the late 90's, in fact, had 2 HP laserjets, each with its own unique set of fonts and documents would constantly re-arrange themselves as people passed them back and forth.

So when you send a Word document to a PDF converter, the only way to get accuracy relative to the original source is if the original user's computer has a "print-to-PDF" printer driver installed and they produce the PDF themselves. Otherwise, count on a certain amount of re-structuring.

You can, however, reduce the disappointments. First and foremost, ensure that your users understand that MS-Word is NOT a typewriter. Avoid hard spacing and using ENTER to produce blank lines. Use tabs and styles instead. Make sure that there's a good match between the document creator/editor machines and the PDF converter machine. Linux can handle TrueType™ fonts these days and there's a common MS-Windows font core package that can be installed.

With a little care, you can do quite well.

Now ask me how to create ePub books
 
You have to be odd to be #1 - Seuss. An odd little ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!