• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Liutauras Vilda
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Devaka Cooray
  • Paul Clapham
Saloon Keepers:
  • Scott Selikoff
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
  • Frits Walraven
Bartenders:
  • Stephan van Hulst
  • Carey Brown

How can i convert a PDF file to XML file

 
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I want to convert a pdf file in a xml file. This pdf file may contain any format like table, text etc. Can anyone give me sorce or any other information regarding this.
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
PDF is not an easy-to-manipulate format by design. It is intended to be a finished product rather than an editable format (like RTF, DOC, HTML and so on). Our AccessingFileFormats FAQ has what options are available to interact with it.
 
Ranch Hand
Posts: 1970
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A PDF is a description of how to render a document on a page. Things like "draw a vertical line here", "write 'foo bar baz' here in Courier". It does not contain any information about the format or organisation of the stuff it is rendering. You won't be able to tell that you're looking at a table, or a list of bullet points, or a paragraph, or anything like that.

The PDF format does contain information on a page-by-page basis. Therefore, page breaks are the one piece of format/organisation information that you can find.

If you want anything more than a raw stream of completely unformatted, disorganised text, one per page, you are out of luck. It's virtually impossible.
 
I didn't do it. You can't prove it. Nobody saw me. The sheep are lying! This tiny ad is my witness!
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic