This week's book giveaways are in the Jython/Python and Object-Oriented programming forums. We're giving away four copies each of Machine Learning for Business: Using Amazon SageMaker and Jupyter and Object Design Style Guide and have the authors on-line! See this thread and this one for details.
I was fiddling around with streams and I thought I'd make an histogram out of a DoubleStream.
I programmed 3 different ways to collect the data (however equivalent). I found some unexpected performance and I'd like to ask your opinion.
The code is very simple: in the main method of the class, a list of numbers following the gaussian distribution is created. From this a stream is created, filtered and collected as a histogram, with 3 methods:
1. I pass supplier, accumulator and combiner
2. I create a Collector and pass it to collect()
3. The collector is created by a static method in the Histogram class (exactly as method 2)
Now the problem is that it takes 1.665 s for method 1, 1.857 s for method 2 and 0.298 s for method 3. I found also that swapping method 2 and 3 do change the execution times! method 2 gets faster 0.3 s and the other one slower 1.8 s. Therefore there is something I'm missing here.
Moreover, If I create a parallelStream out of the list of doubles the whole procedure takes about 15 s!!! Possibly the combine method is not so efficient... but still
I write here the whole code, so that you can try it out yourselves. I tried to comment it as much as possible.
This is due to optimization made by the compiler. The compiled code get optimized after having been run once or more. This is why swapping the order change the result. The first call is slower. You should allow the compiler to "warm up", for example:
Here are the results I get:
You can see that there is still a small difference when calling histo3 before histo2. This shows that all warming up should be done before any test, such as in:
With this configuration, the result are the same whether you call histo2 or histo3 first.
Also note that in theory, this code should not warm up the compiler, since it should detect that the calls to methods histo1, histo2 and histo3 has no effect beside returning the results, and these result are not used. So the method calls inside the for loops should simply compile to nothing. This does not happen here for some reason, but it sometimes happen, in which case you have to do something with the results of the warming up calls.
posted 3 years ago
Thanks to Pierre, at the end I've come to this version. I have written a Collector that extends DoubleSummaryStatistics because before creating the histogram one needs to know the min and max of the distribution. The HistogramCollector stores the data in a LinkedList so that it can access it after the summary statistics are known.