Win a copy of Head First Agile this week in the Agile forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

How to handle Large Lists  RSS feed

 
Paul Jacob
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hi,
I have an application where I get a very large list (approx range would be between 10,000 to 500, 000) of objects. I have to delete around 10-100 objects from this list around 10 times and also randomly access objects by index. So my basic question is should i decorate this list with any particular kind of list (I don't know the type of the list that is returned from the db since I am using spring jdbc) from java collections or apache collections?

help is much appreciated.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How do you locate items to be removed?

What is the relation between deleting objects and randomly accessing objects by index? Is there a sequence of events or are they mixed.

Bill
 
Paul Jacob
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Here is the sequence :
1) Retrieve large list from the db
2) create a new list which just has indexes of the above list (1...n)
3) shuffle this list
4) generate a random number between 1 and n and use this number to index into the shuffled list and get a value
5) this returned value is used to index into the large list ... an object is retrieved and and based on the value of this object, some similar objects are deleted from the large list
6) repeat from 2 to 5 (around 10 times).

As can be seen from the above, there are two lists a large object list from which items are deleted and an equally large list which just has integers but needs to be shuffled once.

The first list is obtained from our db layer where we use spring. The second list is ours..which we create and as of now is an arraylist.

Can you suggest any kind of optimization here?
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Perhaps I am missing something here, but why a second list, why not an array.

Since you know the size you will need you don't require the auto expansion of a list.

There is also no need that I can see for two randomizations - shuffle and random N.

based on the value of this object, some similar objects are deleted from the large list


How are these similar objects located?

Bill
 
Paul Jacob
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is kind of like a lottery application where the original large list is a list of lottery tickets and some random tickets need to be chosen from it.
So the idea of deletion is that when a lottery is chosen, I have to delete all tickets owned by the user whose ticket was chosen so that the next draw will not have his tickets. I am using the apache collection utils filter predicate for deletion.
I don't want to use an array for the shuffle because of the ready made shuffle method available in the collections class. I am ensuring that the list has the capacity while creating the list itself.
so my question remains....is arraylist a good enough list.. for the shuffle?

 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Near as I can tell ArrayList is the simplest / least overhead implementation of List that you can use.

If it is your program that is doing the "draw" - why cant it just ignore hits that correspond to an existing winner and do another "draw"?

The overhead of iterating through the list and removing items must be huge.

Bill
 
Tim Holloway
Bartender
Posts: 18720
72
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
William Brogden wrote:Near as I can tell ArrayList is the simplest / least overhead implementation of List that you can use.

If it is your program that is doing the "draw" - why cant it just ignore hits that correspond to an existing winner and do another "draw"?

The overhead of iterating through the list and removing items must be huge.

Bill


If the application needs to access list elements randomly, a Map is MUCH faster. If you need both fixed-sequence AND random access, I believe the TreeMap will provide that in a more efficient form.
 
Ireneusz Kordal
Ranch Hand
Posts: 423
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Morgan Stanley wrote: This is kind of like a lottery application where the original large list is a list of lottery tickets and some random tickets need to be chosen from it.
So the idea of deletion is that when a lottery is chosen, I have to delete all tickets owned by the user whose ticket was chosen so that the next draw will not have his tickets.

What is a final result of your algorithm ?
If - on the end - you need only n tickets chosen from 500.000, then it would be much faster to do the whole work on the database server using SQL/stored procedures,
and retrieve only this small final result.
Data transmission (through the network - I assume) is much more slower than copying it from the disk to the memory on the db server.
 
Paul Jacob
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Thanks to all for the replies.
The list has to be shuffled and then a random index selected to give a value. Think of this as a pack of cards which we randomly mix and then we pick out a card from it. I don't need the sorting capabilities of a treemap neither can i get the same level of randomness in a stored proc. So I guess the arraylist is the best option ( as william suggested).
 
Paul Clapham
Sheriff
Posts: 22531
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Morgan Stanley wrote:The list has to be shuffled and then a random index selected to give a value. Think of this as a pack of cards which we randomly mix and then we pick out a card from it.


But note that once you've shuffled the list, you have already applied enough randomization. Just taking the first element of the shuffled list (the top card from the shuffled pack) will give you a random element. Choosing some other element doesn't gain you anything.

Or to put it the other way around, if you wanted to choose a random element instead of the first element, then you wouldn't have to shuffle the list in the first place.
 
Pat Farrell
Rancher
Posts: 4686
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would take a completely different approach, and I sure would not have the whole list in memory. That doesn't scale.

Rather I'd use the Google Collections list functions, specifically the Predicate and perhaps the Function.transform.

Write a Predicate that does the random selection in one pass.

The key to performance using Google Collections is to work on Iterables, not List or Set collections. That way, you don't have to have the whole list in memory at once. Plus, its a ton easier to split this into parallel tasks for parallel cores.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!