Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Invite your threading advice  RSS feed

 
Julia Reynolds
Ranch Hand
Posts: 123
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've written a Java webscraper. For each lookup key (currently 500-1000 at a time) it uses the Apache commons httpclient to make a connection and retrieve the page. Then I parse the html for the desired data and pop it into a hashtable, using the original key as the hash key.

After all data is fetched, I iterate through it and update the database with the fetched values.

Here's my question: Should my scraper utilize threads to increase performance? If so, what are the benefits/pitfalls of threading here?

Generally, what indicates that threading should be implemented, especially when writing a utility program such as this one?

Thanks for your time,

Julia
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Parallel threads can speed up a process that spends time waiting for something outside the JVM to happen. Retrieving a web page is a good example. Your JVM executes zero instructions while waiting for a response over the network, so another thread would have a good opportunity to run. At the opposite end, a process that is CPU bound, maybe doing some deep math, would not be a good candidate for threading because the CPU just doesn't have time to run another process.

So with that in mind, how many threads? Good question! You could try adding more and more until you saturate the CPU or the network, then back off a bit so you're not making the JVM work so hard on thread management that it can't do your real work. The whole 1,000 would probably be a Bad Thing.

How do you control the number of threads and know when they're done? The number is easy. If you're in JDK5 look at thread pooling with the Executor class. In earlier JDKs get another thread pool, maybe from Apache Commons. Both are pretty easy to use. You can put your 1,000 requests into a queue as Commands and know you are done when the queue is empty. Hmm, not quite, you'd only know when all the commands have been picked up and started, not finished. Any other ideas on how to know when you're done? The number of values in the map equals the number of commands? Might hang forever if one command threw an exception and never put a result in the map.
 
Free Bird (Kynard)
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Julia,
Why Dont you think of doing something like this...
The moment you are parsing and you have found the required data�write it to a logical queue which is a synchronized resource�there can be another thread which can read the queue and update the database�I think u will achieve a good performance with this kid of a system�

class PageReader
{
PageReader(Queue objQue) { //constrcutor }

readPage(){
//read and parse whatever
}
findData(){
///find your stuff
}
writeToQueue()
{
//wite your stuff to the queue
}
}

class DBUpdate {

checkForQupdate()
{
//read que and find updates
}

writeToDb()
{
if(checkForQUpdate())
{
//write your stuff to db;
}
}
}

Have a thread service that synchronises the activity the sync resource is the Queue
 
Julia Reynolds
Ranch Hand
Posts: 123
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Stan and Mr. Bird for the good advice. I'll get to work on version two of my web scraper.

Julia
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!