• Post Reply Bookmark Topic Watch Topic
  • New Topic

Completed threads  RSS feed

 
Ranch Hand
Posts: 688
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm writting a web crawler, using threadpool and callbacks to speed up the process. Everything works fine except a new requirement came up. The new requirement is to create a dynamic map of the links and pages. The output should look somehting like this:
ROOT (www.test.com)
|
|-> PAGE_1 (www.test.com/index.html)
| |-> PAGE_1_1 (www.test.com/help/help_1.html)
| |-> PAGE_1_2 (www.test.com/help/help_2.html)
|-> PAGE 2
My program right now, it downloads the first page, parse the links, create new download object and assign that to a threadpool. and then continue to parse data from current page while other pages are being downloaded and parsed.
My question is, what methodology can I use to do this? How can I know whether the end is reached?
Any help is appreciated.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!