No crawler can guarantee to never throw an OOME.
Yeah thats probably true, and software will never be perfect, and thats ok for me.
But i'm using a database in the backend, so there should be a way to accomplish this goal. At the moment i created a crawler the following way:
1) create the database "crawler"
2) check the given url if its valid
3) get the host of the url and create a table (if not exist) with the same name (busy, processed, downloaded, level, url)
4) add the given host to a static hashmap
5) download the url and process it (get all urls from that html site) and foreach url do step 2, 3, 4 and 5
6) go through the hashmap and check if there's a host which has elements in level (0...x) that are not processed - if so get the url and do step 2,3,4,5
this is working pretty fine, currently i have nearly 2 million links, i can crawl over yahoo, but unfortunately somewhere is probably a bug, because this night the process stopped somewhere... as i started the debugger to check where the threads are waiting they started working again and i have no idea what went wrong. thats the reason why i'm looking for some professional solution, if there is some.