Win a copy of Rust Web Development this week in the Other Languages forum!

Brendan Rhoads

Greenhorn
+ Follow
since Dec 05, 2011
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
2
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Brendan Rhoads

I guess the actual algorithm of relevance is sort of left open ended. I intended a pretty basic implementation of how often a word appears on the site.

I do like the idea of implementing the % based idea, but most of the time the actual percent would be below 1%, and I wanted to use integers as keys (although, multiplying by 100.0 could solve that).
10 years ago
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. Then, I have a separate class that contains the String of the URL and the word frequency tree that goes with it-- URLContent.

Anyway, we're required to use a minheap of URLContent objects (generated after the search of a keyword/words) in order to return the most relevant sites. However, I can not, for the life of me, think of a good solution for the URLContents' key. Essentially, the more relevant the search is, the lower the key should be.

My brute force idea is to bake a class level integer variable into the URLContent class-- and then subtract how often each of the search words appear from the initialized number (say 100). However, this does not lend itself well to caching(the next part of my project).

1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?
2nd question: Any supplemental ideas with key generation?

Thanks!



10 years ago
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.
10 years ago
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.

Is there a way I can write the .pdf to some sort of parse-able format?

I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).

Thanks


--
10 years ago