• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

MinHeap

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. Then, I have a separate class that contains the String of the URL and the word frequency tree that goes with it-- URLContent.

Anyway, we're required to use a minheap of URLContent objects (generated after the search of a keyword/words) in order to return the most relevant sites. However, I can not, for the life of me, think of a good solution for the URLContents' key. Essentially, the more relevant the search is, the lower the key should be.

My brute force idea is to bake a class level integer variable into the URLContent class-- and then subtract how often each of the search words appear from the initialized number (say 100). However, this does not lend itself well to caching(the next part of my project).

1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?
2nd question: Any supplemental ideas with key generation?

Thanks!



 
Saloon Keeper
Posts: 15510
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Brendan Rhoads wrote:1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?



Well, you presented it as a requirement. If that's what they want you to do, isn't it reason enough?

2nd question: Any supplemental ideas with key generation?



How do you want to measure relevance? The amount of words on a page that matches the keyword? The percentage of words? Maybe the average of the difference between each word on a page and the keyword?

I think you first need to decide when a page is relevant.
 
Brendan Rhoads
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I guess the actual algorithm of relevance is sort of left open ended. I intended a pretty basic implementation of how often a word appears on the site.

I do like the idea of implementing the % based idea, but most of the time the actual percent would be below 1%, and I wanted to use integers as keys (although, multiplying by 100.0 could solve that).
 
Stephan van Hulst
Saloon Keeper
Posts: 15510
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why limit yourself to integers? Real numbers seem like an excellent key to look up elements by relevance.

Personally I would go for the average Levenshtein distance between each word on the page and the keyword. But I guess that's mostly an implementation detail. You can play around to see what would give you the best results.

Regardless, you can start out with a Catalog class to which you can add URLs, at which point it will determine the count of each word on the page and add the results to your list of URLContents. Instead of this, you can consider using a Map<URL, Map<String, Integer>>, or Map<URL, WordTree> if you really must use your own binary search tree.

This Catalog class could then have a search(String keyword) method which returns a Heap<URL>, which prioritizes its elements according to the map structure inside the catalog.

I'm just spitballing here though.
 
reply
    Bookmark Topic Watch Topic
  • New Topic