Page 375 under "Combing results with collect()" regarding collecting parallel streams with the three argument collect method:
The book states: "you should use a concurrent collection to combine the results, ensuring that the results of concurrent threads do not cause a ConcurrentModificationException".
But the JavaDoc (http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.function.Supplier-java.util.function.BiConsumer-java.util.function.BiConsumer-) specifically states: "Like reduce(Object, BinaryOperator), collect operations can be parallelized without requiring additional synchronization."
It seems to me, that what the book states is incorrect in comparison to the JavaDoc. As I read the JavaDoc, collecting with the three argument collect method will not be concurrent at all, if your supplier, accumulator and combiner adhere to the rules specified (but it can still be parallel - see below) - concurrency should only happen, when collecting with the one argument method with a Collector, that supports it.
Page 375 under "Using the One-Argument collect() Method":
The book states: "Requirements for Parallel Reduction with collect()", and then three rules, that it claims must be obeyed for parallel reduction to occur.
The JavaDoc for Collector (http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html) states:
"- For non-concurrent collectors, any result returned from the result supplier, accumulator, or combiner functions must be serially thread-confined. This enables collection to occur in parallel without the Collector needing to implement any additional synchronization. The reduction implementation must manage that the input is properly partitioned, that partitions are processed in isolation, and combining happens only after accumulation is complete.
- For concurrent collectors, an implementation is free to (but not required to) implement reduction concurrently. A concurrent reduction is one where the accumulator function is called concurrently from multiple threads, using the same concurrently-modifiable result container, rather than keeping the result isolated during accumulation. A concurrent reduction should only be applied if the collector has the Collector.Characteristics.UNORDERED characteristics or if the originating data is unordered."
It seems to me, that the book mixes up concurrent and parallel reduction. As I read the JavaDoc, a reduction can be parallel without being concurrent, whereas a concurrent reduction is always parallel.
Concurrent reduction means that several threads access the collection containers concurrently, which doesn't necessarily happen in a parallel reduction. And the three rules the book list for parallel reductions are actually for CONCURRENT reductions, whereas parallel reduction can happen any time, as long as the stream is parallel.
It seems to me, that this is exactly one of the important things about streams, that when implemented properly, they can be parallel without the developer having to worry about synchronization issues. So it is kind of an important point, that the book has got wrong. It is also fairly confusing, when you read this paragraph, because what the book states doesn't really make sense.
Rune Nielsen wrote:It seems to me, that the book mixes up concurrent and parallel reduction. As I read the JavaDoc, a reduction can be parallel without being concurrent, whereas a concurrent reduction is always parallel.
I think this sentence summarizes the discussion nicely. This seems to be a semantic argument to me. Generally speaking, concurrent and parallel are often used interchangeably. Yes, you can have a parallel stream with only one thread, but then it's not really behaving like a parallel stream; its behaving like a serial stream. Likewise, you can have a stream that you declared as parallel and expect to be performed concurrently, but some stream operations can force the stream to be processed in a single-threaded manner.
As far as the JavaDocs, I wouldn't rely solely on them. I read numerous different Oracle-written discussions including the JavaDocs, articles, and even the Oracle Tutorials, and to be completely honest, Oracle is a little vague on parallel reductions at time.
The key is that if you need a stream to be processed concurrently, you just have to follow the rules defined by Oracle also listed in the book. Nothing guarantees concurrent or parallel processing though. For example, the JVM may only allocate one thread for the entire stream. Some developers also use the fork/join famework, rather than parallel streams, as it gives you finer grain control over the thread pool and number of threads.
Scott Selikoff wrote:I think this sentence summarizes the discussion nicely. This seems to be a semantic argument to me. Generally speaking, concurrent and parallel are often used interchangeably. Yes, you can have a parallel stream with only one thread, but then it's not really behaving like a parallel stream; its behaving like a serial stream. Likewise, you can have a stream that you declared as parallel and expect to be performed concurrently, but some stream operations can force the stream to be processed in a single-threaded manner.
Thanks for the reply, Scott. I think, you misunderstood my point though. While I agree, that a lot of developers will use the words "parallel" and "concurrent" interchangeably, they do actually mean something different. And in the case of collecting Java streams, these words have a subtle, but definitely distinct difference in meaning:
Parallel and NOT concurrent collecting: Is (or at least can be) processed by many threads in parallel, DOES preserve order, DOES NOT modify the collections used for collecting in several threads concurrently.
Parallel and concurrent collecting: Is (or at least can be) processed by many threads in parallel, DOES NOT preserve order, DOES modify the collections used for collecting in several threads concurrently.
Both uses multiple threads! But while concurrent collecting is always parallel, parallel collecting is not always concurrent.
The JavaDocs are actually very precise about this - try and search http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html and http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html for the word "concurrent" - this word is only used in very few places, and only in relation to concurrent Collectors. I doubt that is a coincidence.
For a practical example, try and run these classes:
These classes are identical except for one using the three argument collect method and one using the one argument method with a Collector with the CONCURRENT characteristic. My results from running these are:
Parallel non-concurrent collecting test
Common ForkJoinPool size pre-collect: 0
Common ForkJoinPool size post-collect: 3
Resulting set size: 100000
Parallel and concurrent collecting test
Common ForkJoinPool size pre-collect: 0
Common ForkJoinPool size post-collect: 3
Resulting set size: 94157
As the size of the common ForkJoinPool (that is used by parallel streams) clearly show, several threads are used in both tests. But it can also be seen on the size of the resulting set, that there are only concurrency issues because of using non-concurrent HashSets, when using a Collector with the CONCURRENT characteristic. That is because the three parameter collect method does not modify the HashSets used for collecting concurrently, even though the collecting is processed by several threads in parallel.
As stated in my original post, your book fail to explain this subtle difference between the concepts, which is a shame (but not critical, I suppose, since the OCP exam probably doesn't go into this level of detail). Also the statement "You should use a concurrent collection to combine the results, ensuring that the results of concurrent threads do not cause a ConcurrentModificationException" in the paragraph "Combing results with collect()" regarding collecting parallel streams with the three argument collect method, is flat out incorrect, because the three argument collect method is NOT concurrent, even when the stream is parallel and processed by multiple threads.
The JavaDoc also clearly states this under the three argument collect method: "Like reduce(Object, BinaryOperator), collect operations can be parallelized without requiring additional synchronization."
Parallel but not concurrent collecting
If I (ie. Java) for instance have two threads available, and I have a stream of 100 elements, that I want to collect, but my collection container type doesn't support concurrency, or it is important to maintain the order of the stream, I can still quite easily collect the stream in parallel using both threads anyway, because the accumulator and combiner are associative, compatible, doesn't have side-effects, etc. It can for instance be done this way:
1. Split the stream in two streams of 50 elements each.
2. Let each thread create a collection container instance using the supplier.
3. Let each thread collect each their separate half of the stream.
4. When both threads are done, let one of them use the combiner to combine the two collection container instances and create the end result.
This would be equivalent to the code using the three argument collect method in my post above (of course Java uses a more sophisticated algorithm in reality).
Parallel and concurrent collecting
If I have the same threads and stream available, but order doesn't matter and my collection container type supports concurrency, I could still choose to collect in parallel as described above, but if I think it would be more optimal, I could for instance also choose to do the collecting this way instead:
1. Create one collection container instance using the supplier.
2. Let both threads collect elements from the same stream at the same time, each by using the accumulator to accumulate elements to the same collection container instance.
3. When both threads are done, the collection container instance will contain the end result, but I will obviously not be able to say anything about what order, that elements has been accumulated to the collection container instance in.
This would be equivalent to the code using the one argument collect method with the Collector with the CONCURRENT characteristic in my post above (of course Java uses a more sophisticated algorithm in reality).
Both of the above ways of collecting are parallel, but only the second one updates the collection container instance(s) in a concurrent manner.
The CONCURRENT characteristic is an optimization flag, that Java can (but doesn't have to) use, and it only comes into play, when you use or create a Collector (since you can't specify the characteristic otherwise).
Java can collect streams in parallel no matter whether the CONCURRENT characteristic is specified, but if (and only if) it is specified, and you want Java to optimize using it, the collection container type must support concurrency and the stream or collector must also be unordered. If the collection container type doesn't support concurrency, the results of collecting will not be deterministic (as can be seen by running the code in the post above), and if the stream or collector isn't unordered, Java will not use the CONCURRENT characteristic for optimization at all (since order can't be guaranteed when doing this type of optimization).