As an intro, I am working on a project for a 2nd year data structures class, and
we are not permitted to use any libraries other than the Java API.
For my project-- this part of it anyway-- I am creating a
word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format--
http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.
Is there a way I can write the .pdf to some sort of parse-able format?
I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).
Thanks
--