• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

PDF to XML Conversion using Apache Tika

 
sudheer yathagiri kumar
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear All,
I have to convert PDF files to Xml by using Apache Tika,is this is the right choice(PDFBox is embedded).
Can you give sample source code and links related to that.
Actual requirment is in pdf we have tablur data i want to extract that data.

thanks in advance
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what Apache Tika would have to do with this. You can extract the text of a PDF using PDFBox, but it's generally very hard to get at the formatting information in PDFs, so you will likely not be able to distinguish easily which text is in tables in the PDF, and which text isn't.

If you have LOTS of time available, then my advice is the same as I gave here.

Otherwise, my advice is to give up on the idea.
 
sudheer yathagiri kumar
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear Experts,

Actually my requirment is Convert PDF Table Data to xml format using APACHE TIKA.
Can Any one.
Is it possible to overwrite Jars in java.
If yes how can i call the static,private methods in my java class.

Thanks in advance.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.
 
sudheer yathagiri kumar
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.


i download the PDFRerender project and run the code it shows a swing UI and asking file name , it shows only PDF FILE format not more than that,
my actual requirment is not a swing ui and styling,its simply extraction of data ,
there is extraction of data .

i use this link https://java.net/projects/pdf-renderer/downloads
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You misunderstood what I was suggesting. I'm aware that PDF-Renderer displays a PDF in a Swing GUI. What I meant was that -since PDF-Renderer can display PDFs that have tables- obviously its code knows how to extract information in tables. So you could check out what exactly that code does, and adapt that code to your purposes. This involves significant digging into that code, and will probably take a few days to accomplish. But it's the only way I could see how to use free/open source code to accomplish your objective.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic