• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing PDF  RSS feed

 
Aleksey Matiychenko
Ranch Hand
Posts: 178
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Anyone knows of any APIs that allow you to parse PDF files using Java?
 
H.-Gerd Rosarius
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I know a tool which could be suitable.
"PJ" from Etymon. Have a look at the following URL:
http://www.etymon.com/pj/index.html
 
Aleksey Matiychenko
Ranch Hand
Posts: 178
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found the tool but it has no documentation and I am having a hard time figuring out how to parse a document. Any ideas?
 
Val Dra
Ranch Hand
Posts: 439
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
isn't there on a left hand side a documentaion link , i just browsed though it.
 
Aleksey Matiychenko
Ranch Hand
Posts: 178
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yeah, but it does not have much.
 
H.-Gerd Rosarius
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey guys!
Well, you're right about the documentation. In general it is
quite bad and it's not much. Furthermore "PJ" itself is not what I call "comfortable", but it seems we don't have a big choice. I couldn't find real alternatives.
A friend told me, that there is a project from
the apache group that could provide the classes I need, but I couldn't find something. May be he fooled me? )
If YOU find something, it would be kind to post it.
H.-Gerd
 
H.-Gerd Rosarius
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey friends!
Here some code I wrote. It is not working absolutely perfect, but for my purposes it is ok. I was interested in the String-object within a pdf-dokument.
Maybe you can use it for your purposes:
//####################################################################
public void analyze() {

Pdf myPdfDoc = null;
try {
myPdfDoc = new Pdf( this.pathToPdf );
}
catch ( FileNotFoundException fnf ) {
fnf.printStackTrace();
}
catch ( IOException ioe ) {
ioe.printStackTrace();
}
catch ( PjException pje ) {
pje.printStackTrace();
}

try {
if (myPdfDoc.getEncryptDictionary() != null) {
System.out.println("File appears to be encrypted.");
}
else {
int objectNum = myPdfDoc.getMaxObjectNumber();
for ( int i = 1; i <= objectNum; i++ ) {
PjObject myPdfObject = myPdfDoc.getObject( i );
if ( myPdfObject != null ) {
if ( myPdfObject instanceof PjStream ) {
StreamParser sp = new StreamParser();
PjStream myPjStream = ( ( PjStream ) myPdfObject ).flateDecompress();

Vector myvec = sp.parse( myPjStream );
for ( int j = 0 ; j < myvec.size() ; j++ ) {
if ( myvec.get( j ) instanceof XTj ) {
PjString myPjString = ( ( XTj ) myvec.get( j ) ).getText();
System.out.println ( myPjString.getString() );
}
}
}
}
}
}
}
catch ( PdfFormatException pfe ) {
pfe.printStackTrace();
}
catch ( PjException pje ) {
pje.printStackTrace();
}
}
 
Aleksey Matiychenko
Ranch Hand
Posts: 178
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you.
This works like a charm
 
Balbhadra Singh
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Nassar,
I am using pjx.jar api to read pdf document. My aim is to get the "object structure" of the pdf document.
When i try to read pdf files, i get exceptions like ...
while reading pdf file1
com.etymon.pj.exception.PdfFormatException: Token " " not recognized.
at com.etymon.pj.StreamParser.processToken(StreamParser.java:958)
at com.etymon.pj.StreamParser.parse(StreamParser.java:22)
at GetPDFInfo.main(GetPDFInfo.java:35)
-------------------------------------------------------------
while reading pdf file 2
com.etymon.pj.exception.PdfFormatException: Token "%!PS-AdobeFont-1.1:" not recognized.
at com.etymon.pj.StreamParser.processToken(StreamParser.java:958)
at com.etymon.pj.StreamParser.parse(StreamParser.java:22)
at GetPDFInfo.main(GetPDFInfo.java:35)
-------------------------------------------------------------
while reading pdf file 3
com.etymon.pj.exception.PdfFormatException: Token "??? ?Adobe d? ?? C " not
recognized.
at com.etymon.pj.StreamParser.processToken(StreamParser.java:958)
at com.etymon.pj.StreamParser.parse(StreamParser.java:22)
at GetPDFInfo.main(GetPDFInfo.java:101)

Can you please tell me, what is the reason for errors?
The error is caused because of following line ..
Vector myvec = sp.parse( myPjStream );
This line appears in this block of code ..
StreamParser sp = new StreamParser();
PjStream myPjStream = ( ( PjStream ) obj ).flateDecompress();
Vector myvec = sp.parse( myPjStream );
for ( int j = 0 ; j < myvec.size() ; j++ ) {
if ( myvec.get( j ) instanceof XTj ) {
PjString myPjString = ( ( XTj ) myvec.get( j ) ).getText();
System.out.println ( myPjString.getString() );
}
}

Regards,
Balbhadra
 
Sean Sullivan
Ranch Hand
Posts: 427
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://www.pdfbox.org/
 
Brian Pipa
Ranch Hand
Posts: 299
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Take a look at iText: http://www.lowagie.com/iText/
it may do what you need and it is well-documented.
brian
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!