• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Reading a table in a pdf file ?

 
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello ,

Is there any java library that can help me read a table in a pdf file ?
I tried to use PDFBox library but i guess it doesn't allow this.
I need t read a table in the pdf to grab the data in each cell then use this data.

Any help ?

Thanks ,
Hesham
 
Ranch Hand
Posts: 110
Firefox Browser MySQL Database Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
see this link http://schmidt.devlib.org/java/libraries-pdf.html
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the help ... I've checked most of those libraries.
Most of them can extract text from the pdf files, but i don't see any that can read a table and extract the data from each cell.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't think there's a Java library that can do this. Something like JPedal will give you all the text of the PDF, but not cell by cell.
[ November 17, 2008: Message edited by: Ulf Dittmer ]
 
Author
Posts: 836
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Does the PDF file format actually have a concept of tables? It's much like postscript so I'd imagine it only holds layout information for (a) text, (b) vector graphics (including the lines around table cells), (c) bitmap graphics (such as inserted images). Most PDFs aren't directly editable either, for the very reason they don't contain nearly as much information as an original DTP/word processor/spreadsheet document. PDFs are designed for uniformly displaying a document, not for allowing non-human content analysis---at least that's what I understand. So I think you'll struggle to extract anything other than text, lines/shapes and graphics.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I guess you're right Charles ... I first thought a pdf file structure may be like an XML file structure(or something like that) so i can detect tables, images, ...
But i guess i was mistaken.

This means what i want to do is impossible.

Thanks for help.
 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well impossible is maybe a bit harsh, but definitely not easy.
 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I am facing similar issue, used PDF Box and IText not much of luck, Did you came across any solution for this?
 
lowercase baba
Posts: 13089
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sunil, you realize that this thread has not been touched in over two years, right?
 
sunil dm
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi fred,

ya i see that.. But dint find a suitable post to check acoss... So coming to the point.. Do we have anything in these 2 years which made it simple?
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can read the official PDF specification here (PDF). You'll see that it hardly uses the word "table" at all and certainly never in the context of a rectangular grid containing independent cells.

So. PDF doesn't have tables. So if you're trying to get data out of a "table" then you're going down the wrong track. You need to find out how the data is actually organized... but this was all covered in the posts from two years ago.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic