• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Converting PDF to HTML or Parse PDF

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.

Is there a way I can write the .pdf to some sort of parse-able format?

I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).

Thanks


--
 
Sheriff
Posts: 28371
99
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
PDF already is a parsable format. (Parseable? Parsable? Anyway...) And there does exist Java code which parses PDF, just not code which is in the standard API. So you aren't allowed to use it.

The good news is that the spec for PDF is publicly available, and you should be able to track it down on Adobe's site somewhere. The bad news is that it is very large and complicated, and you probably don't have the (my rough guess) six months that it would take to implement even a useful subset of the spec. So probably your best bet is to write off the PDFs as unusable (okay, so "parsable" is the correct spelling) and stick to the HTML.
 
Bartender
Posts: 7645
178
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
PDF-Renderer is a Java library that can display PDFs (or at least some subset of it, not sure about its scope). It's huge. So, Paul was spot on about the effort it would take to tackle this. Maybe you can negotiate something more manageable with your professor.
 
Brendan Rhoads
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.
 
Did you miss me? Did you miss this tiny ad?
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic