This week's book giveaway is in the Reactive Progamming forum.
We're giving away four copies of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams and have Adam Davis on-line!
See this thread for details.
Win a copy of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams this week in the Reactive Progamming forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Can you remove an element from a PDF using PDFBox?

 
Ranch Hand
Posts: 66
3
Netbeans IDE Notepad Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi!

I'm having an issue... It appears you cannot remove an element from a PDF with PDFBox, but I need to do that. Or something like that.

Not too long ago I found out about these really cool things called Optional Content Groups; it's basically PDFBox's name for a 'layer', but I actually like their name better because the way these layers work isn't the same way they work in Photoshop or whatever; their primary purpose isn't to have layered display things, and if you're talking in terms of display, these 'layers' actually have overlapping objects within them, meaning that they aren't actually real 'layers', and they *contain* real 'layers'. What they really are is "optional content" groups that you can use to say "Hey, I don't want to display this. Turn this off."

When you do that though, it doesn't actually remove it from the document, it just adds a flag that says "Don't show this." So I have a situation where I need to selectively hide images on a PDF, and save it so people can view it in-browser.

So here's the problem: It seems that all of the most commonly used browsers come with a default PDF-viewing application though, that has NO support for PDF layers... Which is incredibly inconvenient, and kind of irritating. That seems like kind of an important part of the PDF specification; I don't know how a browser that 'views PDFs' can just ignore that. So they support "almost" all of the PDF specification. On a related note, there's "almost" enough oxygen dissolved in ocean water for me to breathe it.

So I have some options... In order of most to least terrible:

1) I could just add a note that tells people "Hey, this probably won't display right in your browser, download it and open it with acrobatReader. It's free." But I don't want to do that for many, many reasons.
2) I can *add* images with PDFBox; I could just add a little white square ontop of the things I need to hide, but that is a kludge and I do not want to do that. Also, I'd have to know exactly what positions to add them at. Also I think it would take a while; I'd have to have it add several images before saving every time someone asked us to generate a PDF.
3) I could put text fields with non-transparent backgrounds over the images I might need to hide, then populate that with a few spaces when I want to hide a given image. That is also a kludge but it isn't as bad of a kludge and it would be less resource-intensive.
4) I could try to find a way to remove the elements from the document entirely; not just make them "Non-optional", but kill them. I have actually found the object in the Document COSObject list, and I removed it, and I removed it from the OptionalContentGroups list; I've removed it from the COSObject Dictionary; but none of that has worked. Every time, it always stubbornly appears on the page still after it's saved. This is the solution I would prefer to use, but I can't figure out how to make it happen. Has anyone else run into this sort of thing before? I found a few records of people who wanted *all* images removed or extracted, but I don't want them all gone, and it isn't the image *resources* I care about, it's the PDF elements that use them...

On second thought though, since the images aren't really being referenced at their original locations prior to importation, it's possible that they're all there as separate image objects in the document resources... I'm going to try combing through that and seeing if I can eliminate it from there, then I'll head back here and report on whether that worked.
 
Alex Lieb
Ranch Hand
Posts: 66
3
Netbeans IDE Notepad Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found a way. Here's the resulting method:



The document resources object was a dead end; so was searching in the document resources object in the COSObjectDictionary, and the DocumentCatalog.

Eventually though, in this endless morass of nested hashmaps, non-native duplicates of native objects, COSBase, COSString, COSObject, and COSArrays that are only retrievable if you know the associated COSName key, I knew there had to be an object that I could kill that would result in the image no longer displaying.

It turns out the object you retrieve by calling PDDocument.getCatalog() is actually the same document you get when you call PDDocument.getDocument().getCatalog(), which I'm assuming is also equivalent to something like PDDocument.getDocument().getDictionaryObject(COSName.CATALOG);

Also the object you retrieve by calling COSDocumentCatalog.getPages() is the same you'd get by calling COSDocumentCatalog.getDictionaryObject(COSName.PAGES)..
So basically the objects that have methods to allow you to get them aren't special or whatever; they just happen to be hashmap entries that they made special methods to allow you to retrieve more easily.

That would be really great if I didn't have to go another 6 layers into the Hashmap mess to get the object I actually need...

In order to figure out if I'm actually looking at the entry I want to remove, I have to know what the Element's "SUBJ" property is.... Which is different from its "SUBJECT" property.
It ended up being the in DOCUMENT>CATALOG>PAGES>KIDS>[0]>ANNOTS>[?]>SUBJ

So you CAN do it, but I had to go over each layer of this by hand to even figure out what I was looking at. I also went down the wrong path several times, and actually found objects which apparently were related to the one I wanted to kill, but removing them didn't actually do anything. So... If you need to remove an XObject from a PDF using PDFBox, this is apparently how you do it.
 
Morning came much too soon and it brought along a friend named Margarita Hangover, and a tiny ad.
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!