This week's book giveaway is in the Other Languages forum.We're giving away four copies of Functional Reactive Programming and have Stephen Blackheath and Anthony Jones on-line!See this thread for details.
Win a copy of Functional Reactive Programming this week in the Other Languages forum!

Challenge: Optimum way to randomize 1,000,000 numbers

Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
See: http://www.javaranch.com/cgi-bin/ubb/ultimatebb.cgi?ubb=get_topic&f=33&t=005061
Here's the challenge. We need an array of 1,000,000 entries that contains all the numbers from 1 to 1,000,000 in a random order. There will be no duplicates. We are looking for the optimum way to do this that will perform decently (less than 1 minute) on a standard home PC.
This will, as far as I know, serve no useful purpose, but it is driving me nuts because I am sure that there must be a simple, elegant soultion to this.
Winner will get a hearty congratulations!

Zakaria Haque
Ranch Hand
Posts: 60
public class MegRandom {
static int size = 1000000;
static int[] arr = new int[size];

public static void main(String[] args) {
long start = System.currentTimeMillis();
Random rand = new Random();
int counter = 1;
while(counter <= size) {
int index = rand.nextInt(size);
if(arr[index] == 0) {
arr[index] = counter++;
}
}
System.out.println("time took(second) "+ ((System.currentTimeMillis() - start)/1000));
}
}
in took 7 sec in my machine.

Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
1,000,000 took 5 seconds with 16 million iterations of the while loop.
10,000,000 took 50 seconds with 161 million iterations.
It seems very arithmetic in its progression which means that it should be scalable.
For some reason I thought this kind of checking would be slow. But it seems that it is cheaper to waste CPU cycles on the useless iterations than it is to try to avoid the useless iterations!

Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
So is this the limit? Is it possible to get 10 million down to 20 seconds?

David Weitzman
Ranch Hand
Posts: 1365
Thomas, can you provide an official test that verifies there are no duplicates to apply to all candidates? Heh heh. How long do you think It'll take to run it?

Ravi Veera
Ranch Hand
Posts: 127
So is this the limit? Is it possible to get 10 million down to 20 seconds?

2 options worth exploring.
1. Native compilation of the code using GCJ or any other such tool. I can't seem to download it from work. Will give it a try at home.
2. Don't know much about JNI. Is it possible to get/read back a C array into a Java array? A C function may simply be faster because it doesn't perform an Array bounds check each time the array is referenced.

Rob Ross
Bartender
Posts: 2205
How about 10 million in 4 seconds?

You can up the RANDOMIZE_FACTOR to make more than one randomization sweep. It will take longer, but be more random.
Also, I wanted to find out that magic xor swapping algorithm whereby you can swap two values without an intermediate variable and see how much faster that version is. But after investigating it, it seems not to be any faster. I suppose the compiler does a better job now of optimizing my loop than I can with any old assembler tricks
[ April 02, 2002: Message edited by: Rob Ross ]

Greg Brouelette
Ranch Hand
Posts: 144
Thomas, can you provide an official test that verifies there are no duplicates to apply to all candidates?

How about listing all the numbers 1 to 1,000,000 in a collection. Then randomly pick 2 entries and swap them. Do this 2 million times and it should pass a chi square test for randomness with absolute certanity that no 2 numbers are repeated.
Or, start with the same collection and loop from 1 to 1,000,000. Pick a random int between 1 and 1,000,000 that's not equal to the loop index and swap those. That would should pass a chi square test with only 1 million iterations. And once again you would know that no 2 numbers repeat.

Jim Yingst
Wanderer
Sheriff
Posts: 18671
The XOR trick is to replace

with

I don't imagine it will affect your perfomance substantially - it's mostly useful for avaiding the additional storage space of the temp varaible. If we were dealing with arrays of large objects passed by value, this might be significant - but I don't see it being relevant here. If anything, the XOR will be a bit slower, as it requires two more array accesses and three XOR computations.
I like the method, Rob - I'll need to look at it some more to convince myself whether a RANDOMIZE_FACTOR of 1 is sufficient to be fully random (i.e. a completely uniform distribution among all possible permutations) - I think it may be.

Greg Brouelette
Ranch Hand
Posts: 144
Using the 2nd approach from my last post the following code randomized 1000000 numbers in 343 milliseconds.

The output was

And here's the result with 10,000,000 numbers:
randomizing 10000000 numbers took : 3547 milliseconds
[ April 02, 2002: Message edited by: Greg Brouelette ]

Rob Ross
Bartender
Posts: 2205
Greg, your algorithm is exactly the same as mine, although you're peforming an extra check to skip swapping an element with itself.
I'd be curious to know how often it happens that you choose a random value that is the same as the index value. My guess is you have a one-in-initialCapacity chance of this occuring, or 1-in-a-million. Those aren't very good odds, do you think?
Yet you have an additional loop test being made for every iteration. It seems this is pretty wasteful, and it would be more efficient to allow a redundant swap to occur than running code that tries to prevent it.
Also, what is the "chi-square" test?
[ April 02, 2002: Message edited by: Rob Ross ]

Greg Brouelette
Ranch Hand
Posts: 144
I totally missed that your program was the same logic. The RANDOMIZATION_FACTOR is just the number of times you do the swap process right? Now I understand. I should have paid attention better.
Taking the j == x test out actually speeds the process up. I've got it down to 312 milliseconds for 1 million ints. Good catch!
Re: Chi-square, I'll search for a reference to this and post a URL when I find it. But as I remember it from my probabilities class (years and years ago) it's a way to take a sample of a series of numbers and determine their degree of "randomness". A good chi-square result let's you know how good your randomization method is.
In any case, when I saw that someone was running a program that took 20 seconds to do 1 million ints I just knew Java was faster than that. But 343 milliseconds surprised me! That's why I had it print out the first 100 entries. I had to convince myself that it really did what it was suppose to do.
I'll let you know when I find a good URL for that test.
[ April 02, 2002: Message edited by: Greg Brouelette ]

Greg Brouelette
Ranch Hand
Posts: 144
There's a pretty basic explaination of chi-square here http://www.mste.uiuc.edu/patel/chisquare/intro.html
(Geez, It even has cartoons :roll: )
But I don't know if I have the time or desire to write a program that does a chi-square test on our output code. It looks pretty darn random to me
Removing the if(j!=x) test and changing from creating a Date class to simply using System.currentTimeMillis(); (as you did) got the time for 1 million ints down to 297.
Switching to the XOR method brought it back up to 313. That seems odd to me because if it was assembler code an XOR is WAY faster than an assignment. So it must be something in the way that either the compiler or the JVM is handling it.
OK, less than 300 milliseconds. That's pretty darn fast.

Rob Ross
Bartender
Posts: 2205
I read this, but I don't think that chi-square is applicable to the question of "how random" is this distribution.
We know that there is only one unique value of every possible value in our domain, ie, 1-1 million. So there is a completely even distribution of these values.
The question is, how "random" is the position of these values after they have been shuffled?
I'm not a statistics or probability expert, heck, I'm not even a probability beginner, but it seems to me that the ordering of the integers will be as random as the Random number generator class is. So any testing or conclusions about the randomness of the ordering is really about the randomness of the RNG being used by Random, which of course is *really* only a pseudo-random generator.
But as I said earlier, I don't think that chi-square is a useful metric here. But that begs the question, what is? How do you test the ordering and determine how "random" it is? How do you define this anyway? I'm out of my league here, but I think if you could show that there was some trend, for example, that numbers didn't seem to migrate very far from their original location, then that would be a symptom of a non-true-random order.
But there must be some formal way of testing randomness. I am curious now what that might be.
[ April 02, 2002: Message edited by: Rob Ross ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
I've already addressed the XOR issue. If you were swapping three local variables it might be faster, but those array accesses are more expensive than the otheractivities, I think.
Regarding the distribution - actually I can prove that the distribution from this method cannot possibly be exactly uniform. Consider an array size of three. Using a RANDOMIZATION_FACTOR of 1 means that the swapping loop will make three passes, each time generating a random number from 0 to 2. That yields 27 equally probably outcomes. But there are exactly 6 different permuations of three numbers - 123, 132, 213, 231, 312, 321. Because 27 is not divisible by 6, these six permutations cannot be equally likely. A quick hand tally seems to indicate that 123, 312, and 321 each have a probability of 4/27, while 132, 213, and 231 each have probability 5/27. I could easily have made an error in my tally - but the fact remains that theres no way for all six permutations to be equally likely here. However it is reassuring that they do seem to come out as close as reasonably possible to uniform probability. I'd have had severe doubts if one had come out to 2/27, for example. My feeling at this point is that the overall distribution from this method is pretty good really, and probably gets better for larger array sizes. I just thought it worth noting that it couldn't possibly be exactly uniform.
Rob, you're right that the random number generator itself may be a limit to the randomness of the outcomes. The preceding discussion is addressing limitations of the shuffling algorithm, independent of the random number generator.
The chi-square test seems applicable to me. Imagine re-running the generator many times, and building a table of how many times each particular permutation shows up. (This is quite infeasible for permutations of millions of elements - but certainly possible for permutations of, say, six elements.) Chi-square basically looks at how often each of those permutations actually shows up, compared to the expected value of 1/720 of the time. Seems useful to me - I'd be interested in seeing how this algorithm fares for the larger arrays. But, I'm too lazy right now to do it myself, so I'll be content to kibbitz from the sidelines.

Rob Ross
Bartender
Posts: 2205
Interesting. I don't know enough about this subject to even know what I don't know, so all I can say is..."interesting."
If someone can condense the algorithm for writing a chi-square test against this array sequence, I'll code up some analysis tests.

Greg Brouelette
Ranch Hand
Posts: 144
I've already addressed the XOR issue. If you were swapping three local variables it might be faster, but those array accesses are more expensive than the otheractivities, I think.

You know what else it may be? a^=b; is both an XOR and an assignment. Whereas
int temp = a;
a = b;
b = temp;
is just 3 assignments. So now that I think of it 3 assignments would be faster than 3 XOR's and 3 assignments.
But for anyone who complains that Java is slow I'd like to show them this.

Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
Rob's method handled 10,000,000 in 4.5 seconds. That is the limit of my PC as anything beyond that generates an OutOfMemory error.
A hearty congratulations to Rob for an impressive demonstration of the speed of Java in the hands of a clever programmer!

Rob Ross
Bartender
Posts: 2205
Well I wrote a multi-threaded version of this randomizer. I didn't expect it to be faster on my single CPU machine, and it's not - I get 10Million in about 4 seconds with the single-threaded version, and about 10 seconds with the 2-threaded version.
But, if anyone has access to a multi-cpu machine, I am curious to see which of these two methods is faster on that. If the OS and the JVM are well tuned, I would expect that both threads could each get a Cpu, and if that happened, my multi-threaded version should be faster than the single threaded version.
SO...does anyone have a 2 cpu machine they want to test this on??
Thanks.
[ April 04, 2002: Message edited by: Rob Ross ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
As a belated followup: here's the code from the Collections class:

It's easy to see that this is basically the same method as Rob used, except that the size of the region being randomly chosen from is shrinking. The API describes it thus:
This implementation traverses the list backwards, from the last element up to the second, repeatedly swapping a randomly selected element into the "current position". Elements are randomly selected from the portion of the list that runs from the first element to the current position, inclusive.

The API also claims that for a perfect random number source, all possible permutations are selected with uniform probability. This seems plausible - for the simple case of 3 items, a bit of hand-counting reveals that the six permutations 123, 132, 213, 231, 312, 321 all occur with equal probability. (Compare this to my last post, showing the results for Rob's algorithm.) For n items, we can see that the number of possible outcomes for choosing nextInt() n times is n!, which is the same as the number of permutations of n items. It's not blindingly obvious that each of these outcomes necessarily maps to a different permutation, but it certainly seems plausible. And Josh Bloch said so, so it must be true.
The original thread here did mention that Collections.shuffle() might be useful. Of course, using Objects its obviously not as fast as it could be for this problem - but it would be simple to adapt this to use ints instead. It should give the same speed Rob's method did (perhaps a tiny bit faster), with a more correct distribution.
[ April 12, 2002: Message edited by: Jim Yingst ]

Thomas Smets
Ranch Hand
Posts: 111
The question one may first ask is :
How many time you need to call for a number ?
Because all the methods are OK but they have some set up time !
While the following has much smaller setUp time but of course any new insert is taking longer & longer & longer & longer...
Apparently, this would go slightly faster for accessing +/- 1000 number (or less). For 10000 numbers the proposed algorithme is getting much much faster !
I recoded the algorithme to be embedded in a class & it runs slightly slower. 1000000 numbers in :

randomizing 1000000 numbers took : 547 milliseconds

Thomas,
[ April 15, 2002: Message edited by: Thomas SMETS ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Thomas Smets - your algorithm allows duplicates. The idea is to put a group of numbers in a random order, with no duplicates.