Win a copy of Functional Reactive Programming this week in the Other Languages forum!

# Collections.sort(List, Comparator)??

eric sato
Ranch Hand
Posts: 37
hi all,
Is there any formula to calculate [the worse case] the number of times accessing the method c.compare when Collections.sort(List list, Comparator c) is used?
Best Regards,
sato

John Smith
Ranch Hand
Posts: 2937
Is there any formula to calculate [the worse case] the number of times accessing the method c.compare when Collections.sort(List list, Comparator c) is used?
Since you are using this particular method that takes a comparator as a parameter, then you could simply increment a counter in the compare() implementation of your comparator. That would be an experimental way of doing it. You can also do it theoretically: the javadocs tell you that the sorting algorithm is a modified mergesort which offers guaranteed nlog(n) performance. So, if you have 100 elements in your list, your worst case number of invocations of compare() would be 100 * log(100).

eric sato
Ranch Hand
Posts: 37
hi,
Yes i read that javadoc too. I tested on list with size 8 [all in strings] and do sorting according each of the string length.
The max counter hits is 17.
Thus the formula for n log(n), 8 log 8 = 7.2247 . it sounds like not really accurate.
Perhaps is this n log(n) is just a average running times.
Pls consult and thanks again.
Sato

Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
I think the "guarantee" is worst case, guarantees this or better. How would we test the worst case? Different sort algorithms have different weaknesses ... I recall that one (quicksort?) which is usually relatively quick bogs down horribly if the input is already sorted. Data makes so much difference that we had mainframe sort products that sampled the input to decide what algorithm to use.

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Mmmm, I'd beware of interpreting performance guarantees this way. Often when they say something like "n * log(n) performance" they mean that performance will be proprortional to n * log(n) (or whatever other formula they provide). Note that they don't generally specify the proportionaility constant. And note that the formula they provide may well omit some terms which are deemed less significant. E.g. a more precise formula might be something like
A * (n + B) * (log(n) + C)
where A, B, C are constants. B and C will be insignificant if n is large enough, so they're often ignored. A is not so insignificant, but it's generaly ignored too, since we're usually more interested in the shape of the graph than in its actual values. (Is it linear? Quadratic? Exponential?)
Note also that log(n) generally means logarithm base 2 of n. Though this doesn't really matter much in most cases, because log base 2 or n = log(n) / log(2) (using whatever other type of log you want to use, base 10 or natural log) so log base 2 is proportional to log base 10, which means the proportionality constant could just be included in the factor A which we're generally ignoring anyway.
As Stan noted, we also need to pay attention to whether a formula is for worst-case behavior, or typical. In this case the API doesn't really say, though I believe it's both, for a mergesort.
Anyway, all this was just to say that if you want an exact formula for the number of comparisons, you can't necessarily trust the API guarantees, because the language they use is vague enough to hide numerous factors.
Now it turn out that in this case, the formula they provide it pretty accurate, and A = 1 (or pretty close) once you calculate log base 2 correctly. Here's a test program:

Note that if you comment out the shuffle(), you get much better performance. And if you use add(0, new Integer(i)) to create the list in reverse order initially, performance is still pretty good. These are two common cases which could have yeilded worst-case performance for some algorithms, but give good performance here.
So - the formula n * log(n) is in fact pretty accurate here. But beware of taking such formulas too literally in the future - there are often other issues being hidden.

Herb Schildt
Author
Ranch Hand
Posts: 253
6
I want to echo what Jim is saying about the meaning of n log(n) in the documentation describing Collections.sort(). In this context, n is the number of elements being sorted. As Jim states, it means that Collections.sort() runs proportional to n log(n). Thus, n log(n) describes in a general way how the runtime performance of a sort is affected when n increases. (Or, as Jim states, it describes the shape of the curve.) Of course, the main ingredients of sorting are comparisons and exchanges, so how often these occur dictiates the overall performance of the sort.
Different sorting algorithms have different performance curves. Some sorts, such as the Bubble sort, are very poor, with performance proportional to n-squared. This means that the Bubble sort's performance is proportional to the square of the number of elements being sorted. This makes the Bubble sort unacceptable for sorting anything but very small sets of data.
By contrast, the best general-purpose sorts run proportional to n log(n). n log(n) is a much smaller value than n-squared, especially for large n values. Thus, when the API documentation describes Collections.sort() as "guaranteed n log(n) performance", it means its runtime is proportional to n log(n). This puts it the "best-practices" category of generalized sorts.

John Smith
Ranch Hand
Posts: 2937
It's been a while since I took my "Algorithms" course, so I looked it up in my Algorithms in Java book. Among other things, it states these two:
• Mergesort sorts [...] N elements in time propotional to N log N
• Mergesort requires [...] N lg N comparisons to sort [...] N elements (note the difference between log and lg).

• Now, since the original poster question was about the number of times the compare() method is invoked in the worst case scenario, I would still maintain that the answer is NlgN (well, I did say that it was NlogN, but I accept Jim's comment about the base of the logarithm). I experimented a bit with Jim's code to prove it, but I had a hard time coming up with the worst case scenario: mergesort is a "divide and concur" approach, so sorting the list in the opposite order is not the worst case scenario.
Anyway, I added another loop to Jim's code to see how the number of compare() invocations scales with the size of the list, and got some very persistent results for the lists with the following structure:
10000, 1, 99999, 2, 99998, 3, ...
Here are the results:
Elements Actual Expected Ratio
-------- ------ -------- -----
100000 1233791 1660964 0.742816198760518
110000 1367803 1842185 0.7424891509035242
120000 1502639 2024720 0.7421462071380728
130000 1639211 2208459 0.7422417289476446
140000 1772751 2393309 0.7407111606788924
150000 1909659 2579190 0.7404102332803668
160000 2049471 2766033 0.7409420904730387
170000 2189003 2953779 0.7410853744876023
180000 2331231 3142374 0.7418691917289426
190000 2473755 3331771 0.7424743686198316
200000 2617583 3521928 0.7432244297661377
The "expected" column is size * Math.log(size) / Math.log(2) (i.e. the log is base 2, as Jim suggested). The ratio is still well below 1, which tells me that my worst case is not bad enough.
[ October 07, 2003: Message edited by: Eugene Kononov ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Now, since the original poster question was about the number of times the compare() method is invoked in the worst case scenario, I would still maintain that the answer is NlgN
I think that's probably correct; I just objected to accepting the API's word for it, given that the API wording was inherently vague.
The ratio is still well below 1, which tells me that my worst case is not bad enough.
Particularly given that you can use Collections.shuffle() and easily get results around .95 for N > 100000. Your test cases are just too orderly, it seems.