• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Dividing a large list of items between threads

 
Ranch Hand
Posts: 3695
IntelliJ IDE Java Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm looking for advice on the best approach to a problem like the following. I have an object that is responsible for retrieving rows from a database and doing quite a bit of processing for each entry. It also does further queries for each row, does a bit of templating and merging, potentially generating an email. Right now, the object just takes the selection criteria in the form of a key; work on all the rows that have 'this' value in the status field.



What if my rowcount is in the thousands? Right now it's just hundreds, and a single thread takes an acceptable amount of time. Eventually, I'm going to click a button, and it will take all night. So I want to multithread this...

My questions is.. where do the threads go?

1)
Should I produce a 'Manager' object? Move the overall query from the existing class into this manager object, turn it into a SELECT COUNT(*) rather than SELECT *, have it divide by the 'maxThreads', and then instantiate 'maxThreads' instances of my (now slightly modified) processing object. Each modified object would then work on its own subset of rows. The only change in the existing processor object (I think) is the method signature would be two long ids, rather than one and the query would change from SELECT * where status=id to SELECT * where row_id between first_id AND last_id. Otherwise, all the code remains the same.

2)
Forget the 'Manager' and simply implement threads within the existing class?

Is there a best practice here? Is there a better idea?



hmm... I just thought better of this. While currently, my rows are pretty much always in a contiguous block, this is not a guarantee. And the way MySql does LIMIT is not safe either:
If you use LIMIT # with ORDER BY, MySQL will end the sorting as soon as it has found the first # lines instead of sorting the whole table.


So I think I need to retrieve all the PK's and stuff them into a List. Then send sublists into the processor object. Yes?
[ December 31, 2003: Message edited by: Mike Curwen ]
 
Ranch Hand
Posts: 1365
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In an ideal world you would be able to isolate the blocking operations and just do those in a thread pool, but alas it sounds like your code will be doing a lot of networking and other blocking operations all the way through.
I vote #2.
When I'm converting single-threaded code to multi-threaded code, I like to start out as simple as possible and keep my code looking approximately the same.

I figure fewer changes decrease the chances of introducing bugs into already working code. No counting or clever allocating necessary. The database is probably also optimized for lots of accesses to the same region. But either way, it's the simplest thing that could possibly work and it's a good way to go about doing things if you're using Test Driven Development.
[ December 31, 2003: Message edited by: David Weitzman ]
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I like Dave's idea. To describe it in one sentence, I'd use workflow words and say: Create a single queue of work that needs done and any number of threads that pull the "next" piece of work from the queue. If you need to generalize the queue to hold different types of work, consider putting "command" objects in the queue instead of simple data structures. Sounds like fun!
 
Mike Curwen
Ranch Hand
Posts: 3695
IntelliJ IDE Java Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, so I've got my head around that first idea. (don't split the tasks, let the threads 'naturally' split them).

So something like:

I had thought to use Doug Lea's package, specifically PooledExecutor. I'd do something like:
So this would allow 'natural' splitting of the large task between a number of threads. But the problem is in the pool.execute(FOO).

All the PooledExecutor gives me is a bunch of 'dumb threads', that need to be told what to do (given a command object, FOO). Except my particular FOO object contains a number of non-lightweight components, and I'd want to re-use them. The things like: manager object, manipulator object, template merger, etc.

So what I really want is a 'smart' PooledExecutor or perhaps I should say PooledCommand ?

From Doug Lea's package, the FJTask and FJTaskRunner seem a closer fit. In particular, the concept of one task thread 'stealing' work from another threads' queue is great. But then doesn't this lead me back to packaging out the large task into discreet amounts, separate queues for separate threads... something I might try to avoid, if I could?

Am I stretching myself in the wrong directions?
[ January 05, 2004: Message edited by: Mike Curwen ]
 
David Weitzman
Ranch Hand
Posts: 1365
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This pattern will be a little more complex to implement because, as you're noticed, you can't use local variables to hold objects for reuse in each thread.
You can, however, use ThreadLocal variables to accomplish the same thing. If you use setThreadFactory() to assign your PooledExecutor a custom ThreadFactory, you can wrap the passed Runnable inside another Runnable that builds the desired objects and stores them in ThreadLocals. You can also rely on the ThreaLocal initialValue.
However this doesn't look like a textbook case where a thread pool is necessary. Work doesn't need to be queued as it's found -- there's a constant stream of it. Only real advantage I see to a thread pool is that you don't have to worry about uncaught RuntimeExceptions here or there. Typically people use thread pools when they'd like to use an infinate number of threads but can't actually spare that many (i.e. thread per connection to a server). Here you're using a fixed number -- the maximum the pool will support. There shouldn't been any need to set keep alive time or a minimum pool size. Just launching a fixed number of threads would work equally well.
[ January 05, 2004: Message edited by: David Weitzman ]
[ January 05, 2004: Message edited by: David Weitzman ]
 
Stan James
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Interesting. I was thinking of casting the control the other way. So the runnable object would say:

and the main program would say:

Does that make sense? This eliminates the thread pool idea, but does not support keeping idle threads around between batches of commands. When you empty the queue, the threads all go away.
 
Ranch Hand
Posts: 531
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Curwen wrote:I'm looking for advice on the best approach to a problem like the following.  I have an object that is responsible for retrieving rows from a database and doing quite a bit of processing for each entry. It also does further queries for each row, does a bit of templating and merging, potentially generating an email.  Right now, the object just takes the selection criteria in the form of a key; work on all the rows that have 'this' value in the status field.



What if my rowcount is in the thousands?  Right now it's just hundreds, and a single thread takes an acceptable amount of time.  Eventually, I'm going to click a button, and it will take all night.  So I want to multithread this...

My questions is.. where do the threads go?

1)
Should I produce a 'Manager' object? Move the overall query from the existing class into this manager object, turn it into a SELECT COUNT(*) rather than SELECT *, have it divide by the 'maxThreads', and then instantiate 'maxThreads' instances of my (now slightly modified) processing object.  Each modified object would then work on its own subset of rows. The only change in the existing processor object (I think) is the method signature would be two long ids, rather than one and the query would change from SELECT * where status=id to SELECT * where row_id between first_id AND last_id.  Otherwise, all the code remains the same.

2)
Forget the 'Manager' and simply implement threads within the existing class?

Is there a best practice here?  Is there a better idea?



hmm... I just thought better of this.  While currently, my rows are pretty much always in a contiguous block, this is not a guarantee.  And the way MySql does LIMIT is not safe either:
If you use LIMIT # with ORDER BY, MySQL will end the sorting as soon as it has found the first # lines instead of sorting the whole table.


So I think I need to retrieve all the PK's and stuff them into a List. Then send sublists into the processor object. Yes?
[ December 31, 2003: Message edited by: Mike Curwen ]



Hi, Mike,

Don't reinvent the wheel, what you are trying to do is processing- and io-intensive, use an available enterprise-grade component such as Quartz in Java to process your workload. You can cluster Quartz but you will have to come up with a mechanism for partitioning your data yourself so that a Quartz job picks up a portion of your total workload.

I am not sure if that's the answer you're looking for. I don't think you should mess with thread pools and the like when all you really want is dependable parallelism. No job is too small for Quartz.

With best regards,

Anton.
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Anton Golovin wrote:Hi, Mike,

Don't reinvent the wheel...



Actually Mike wasn't reinventing the wheel when he originally started this thread, nearly 13 years ago. Nowadays we would just set up an ExecutorService and we would be just about done, but in those days pre-Java 5 we didn't have that luxury. And hence the question, and the reference to "Doug Lea" who did a lot of the design work which underlies the java.util.concurrent classes.
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Would you use a Stream and its parallel() method nowadays?
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's another "wheel" which was even more unknown 13 years ago.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic