Win a copy of Practical SVG this week in the HTML/CSS/JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Crawler getting slow when downloaded content or time increases

 
Mano Krrish
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi help me please.....

This is a problem i am facing for a long time. I am using a crawler program to crawl the web, which is a stand alone java program. Currently i am crawling nearly 40 websites using this. It is working fine till it crawls around 25 websites, after that, it is getting slow gradually and atlast taking a long time to complete it...I also checked for JVM memory, It is always having nearly 70 to 90 percentage of the memory, free....

Because of this problem, i can't even increase the number of websites for crawling......

Can any one please give me a solution regarding this.....

Thanks in advance........
 
Wouter Oet
Bartender
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Mano and welcome to the JavaRanch.

We have no idea how your program works and because of that we can't say anything useful. I would recommend you to profile your application and look where the problems are. Try to fix those. Also since your program is not memory bound and I assume not i/o bound I think that you should look at multi-threading your application.
 
Mano Krrish
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your response Wouter Oet....
Any way i am already using the Multi-Threading in my application, Though i am not sure about its efficiency....Here i post the sample of thread concept i am using...Could you or any one make some guidance for me...



Creating Connection inbetween...and then...



"Outlet_ID" is the unique id for each website
startUrl" is the home page of the website



And then.....the removal of the completed thread and starting the next thread is done in the below given process...




What i am doing in the above code is.....Creating thread objects for every websites and queuing them all in a linked list. Then starting the 15 threads from there....if any of the threads finish its process, i am removing that thread from the queue and starting the next one which is at the top of the queue.....

Please let me know if any change in this threading would make the process efficient....
 
Wouter Oet
Bartender
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is still no way of knowing where the problems are. You should profile your application. You could simplify your multithreading approach by using executors from the java.util.concurrent package.
 
Campbell Ritchie
Marshal
Posts: 52664
122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sounds like something too difficult for "beginning"; not quite sure where to move it, but let's try the threads forum in the first instance.
 
Mano Krrish
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I now tried with profiling my application....This profiling is very new to me.....i used Hprof for profiling and i think this can be used for optimization more than what i had understood on this......So i am posting here, heap dump of the profiled content.Could any one help on this? also, please suggest me, where the optimization is required.......



and few important stack traces are

 
A day job? In an office? My worst nightmare! Comfort me tiny ad!
the new thread boost feature: great for the advertiser and smooth for the coderanch user
https://coderanch.com/t/674455/Thread-Boost-feature
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!