Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

how to group by size/token in java 8  RSS feed

 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm working on a task at work that I said I was going to do in Java 8. (I'm trying to show good use of APIs). For the most part, it's going well. Then I got to one piece of the code. The only way I can think of to do it with streams is via a custom collector which doesn't seem easier than the pre Java 8 way.

This section needs to split up a list of values so that each group is less than a certain size (10 in this example) *and* it can only split the groups on commas.

So the output of this example is:
123, 456, 789
abc, def




The above code works. But it's not using Streams. And it is ugly. Is there a better way? (I do call this method from a stream workflow)

Originally, I was closer. I was trying to do something like this:



The regular expression isn't what I used at work. (That's still on my work machine and I didn't bring it home. I couldn't get it working anyway though so moot point.) If the regular expression approach worked, it would be using streams though.>
 
Scott Selikoff
author
Bartender
Posts: 4087
21
Eclipse IDE Flex Google Web Toolkit
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know about replacing with streams (haven't tried yet...) but you can certainly shorten your loop by a couple of lines:



You could also have a check inside the loop to skip current if current.length() is 0. Not sure how you want to handle strings like "123, , , abc"
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scott,
I want to see if I can get rid of the loop That isn't my exact code anyway. I re-wrote it at home to ask this because I didn't think about asking for a better way while I was still at work.
 
Campbell Ritchie
Sheriff
Posts: 55284
156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can turn a T[] into a Stream<T> with the Arrays#stream() method. Since you have a String[], you will get a Stream<String>.
The collect method does a mutable reduction but it requires a Collector to do it with. There are all sorts of Collector; you can get them to summate, average, create Lists, Maps, etc etc.
The Collectors class is a utility class which has methods returning a ready‑made Collector or a Collector made from a λ. So I found a simple λ to map a String to its length; that returns an int.
The groupingBy Collector collects the contents of the Stream into a Map, in this case a Map<int Integer, List<String>>.
java SizeGrouper 123 1234 234 2345 23456 9876543 345 3 33 345678 123456789 0 1234567890 12 098765432
{1=[3, 0], 2=[33, 12], 3=[123, 234, 345], 4=[1234, 2345], 5=[23456], 6=[345678], 7=[9876543], 9=[123456789, 098765432], 10=[1234567890]}
Alternative approach: use the partitioningBy collector which gives you a Map<Boolean, List<String>>.
java SizeGrouper 123 1234 234 2345 23456 9876543 345 3 33 345678 123456789 0 1234567890 12 098765432
{false=[123, 1234, 234, 2345, 23456, 345, 3, 33, 0, 12], true=[9876543, 345678, 123456789, 1234567890, 098765432]}
I hope that actually answers the question you were asking, Jeanne You can get the Strings sorted with a .sorted() call before .collect.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell,
No, that isn't the problem I'm trying to solve. I need the size of the combined string to be less than X characters (but as large as possible). If I wanted each element separately, I could have just used a split. I also need to preserve order.

A custom Collector could solve this. I don't think I'm better off with that though. It certainly isn't a one line lambda because I need to remember the # characters seen so far.
 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a reduction, and the intermediate accumulator (a list of strings) is mutable, so you need a collector. None of the standard collectors will do what you need, so you need a custom collector. If that's not what you want, then you need to stick with procedural code.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:This is a reduction, and the intermediate accumulator (a list of strings) is mutable, so you need a collector. None of the standard collectors will do what you need, so you need a custom collector. If that's not what you want, then you need to stick with procedural code.

Thanks. That's definitive. It also explains why I couldn't find an alternate approach
 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Likes 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is what I would write looking just at your problem description.

[edit]

Thanks for the cow Jeanne!
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's perfect!. I could do that with Pattern's stream API too. I like that there isn't an old style loop.
 
Pierre-Yves Saumont
Author
Ranch Hand
Posts: 103
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using the values "123, 456, 78901, abc, def", your original example prints:



This does not meet the requirements (first group has 11 characters)

Using "123, 456, 789, abc, def" with Stephan's version prints:



which does not meet the requirements neither.

Beside this, here is what I would do:



It is a bit verbose due to the limitations of Java. Methods add and addToLast are necessary to use functional style (with Stream.reduce). And the defensive copy in each of these method is necessary to (try without to see the consequence).

Note that I have not implemented the part converting the strings to lists, but this is trivial an uninteresting.

Also note that in the following line:



The third parameter of the reduce method ((a,b) -> a) is incorrect. Replace it with the correct implementation if you want to use parallelization.

This program prints:

[123, 456]
[78901, abc]
[def]

[123, 456, 789]
[abc, def]

I guess this is the intended result.

This program would be much simpler using a functional library as the one I describe in my book "Functional Programming in Java" (Manning)
 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah, I assumed that spaces and commas were included in the count. If not, I would just replace String in my example with a custom type that keeps track of Strings joined so far, and the sum of their lengths.
 
Pierre-Yves Saumont
Author
Ranch Hand
Posts: 103
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No doubt your solution is very concise and efficient. Probably one of the best solution for this specific problem.

However, looking at this problem with a functional mind shows it has a main drawback: it is not as generic as it could be. It takes a string as its input and produces a list of list of strings. It does this well, but it can only do this although most of the work could be fully generic, and thus abstracted for later reuse.

Looking at it with a more generic point of view, here is what this program does: It takes a list of objects and returns a list of list of objects, grouping these objects according to a criteria represented by a condition that should be respected. The condition is that a reduced value of each list, obtained with a reduction function and an identity object, should satisfy a given predicate

Here, the list of objects is a list of words. (Although represented by a comma separated string, it is actually a list.) The return value is a list of lines (although represented by a list of comma separated strings). The reduction function is the sum of the length of the strings in a list. The identity is an empty list of list of strings. The predicate takes the reduced value, add the length of the next word, and returns true if it not more that a given value.

In practice, this could be a way to typeset words in lines of a given maximum length. (In reality, we should take in account the number of words in each line, so the real problem might be something different, but this is another story.)

A functional programmer, looking at all these abstractions, would probably implement them in a generic form. That way, if she was to solve this problem again, she could just apply the given function. Such a functional implementation might seem harder to write and/or to understand, although only for imperative programmers. But it is easier to test, much much safer to maintain and evolve, and reusable.

First, lets start with the types. For input, we can use a List<T>. If we want to start from a comma delimited string, we will just have to convert it to a list of String, which is both trivial and problem specific. We however need a type for output. Defining our own type will allow us to write functional methods, for example for adding elements, making using this type in functions much easier. The standard add method on List is not functional because it does not return the modified list. Incidentally, it should not even returned the modified list, but a new list with the added element.

Here is the Output class:



As you can see, it contains a List<List<T>, which will be our result. But it also contains a function from Output<T, U> to U. This is the function that is used to reduce an element of type List<T> to a single value of type U. Nevertheless, this function takes an Output<T, U> as its argument (instead of a List<T>) and applies the reduction to the last list in the List<List<T>> member. The methods addToNew and addToLast perform the addition of an element while taking care of defensive copy and returning the result. This would not be necessary if we where using functional immutable data structures.

To simplify use, I added the reduceNoParallel helper method taking care of stream reduction since parallelization is not wanted (nor possible!). The “combiner” third argument, as Java 8 calls it, is a BinaryOperator used to aggregate the results returned by parallel threads when automatic parallelization is used. We can't use parallelization in this case, but Stream unfortunately does not include a true reduce method without this BinaryOperator. Of course, this would not work with parallelization, but it is not a problem since it is encapsulated in our reduceNoParallel method. This method is static and could be put anywhere, but since it is generic, I put it in the Output class.

To use this class, we must first define the reduce function that we will be used to instantiate an Output: I called this function lengthSum:



As the identity, we will use:



Now, we can define our main function (the core of our program!):



With these two elements, we can solve our business problem:




This prints:



Not only is this implementation fully functional, using only pure functions and never sharing mutable state, but it is also fully generic and may be used as is to solve many other similar problems based on different types.




 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There's always a trade-off between being generic and being useful. Your solution could be more generic if it took a source file and a data set, and returned the result of feeding the data set to the interpreted program. Would such a function be useful or performant? Unlikely.

I don't want to sound facetious, but this is a real problem. Java simply isn't built to perform such generic functions with great efficiency. It can't make assumptions about referential transparency, it can't do lazy evaluation without much overhead, and the syntax also isn't made for large pieces of functional code.

It took me a while to figure out exactly what your Output class was doing. It looks to me that it's mostly a wrapper around Lists to make them appear immutable, and the onus of writing most of the grouping is still on the client of the class. I didn't benchmark the application, but I think for a real-life dataset it could be prohibitively expensive, because on average half of the entire input set is copied on each step of the reduction. Note that a wrongly written reduction function could still break the output if it performed operations on the results of last() and values(), which are not defensively copied or wrapped in an immutable collection.

If I were to make a reusable generic algorithm out of this, I would write a collector that looks like this:

For Jeanne's problem, I would call it like this:

[edit]

Oh and please have a cow for getting me revved up again on functional programming in Java. Admittedly, it doesn't take a lot, but I do always enjoy the discussions. You too Jeanne!
 
Winston Gutkowski
Bartender
Posts: 10573
65
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:There's always a trade-off between being generic and being useful. Your solution could be more generic if it took a source file and a data set, and returned the result of feeding the data set to the interpreted program. Would such a function be useful or performant?

It strikes me also that your Deque-based solution (which is close to - but better than - what I was thinking of when I read Jeanne's post) is eminently primed for "generification". All you need to do is change String to <T> and provide it with a Stream of 'em ... no?

Admittedly, it doesn't take a lot, but I do always enjoy the discussions. You too Jeanne!

And good for you mate. As you know, I'm not completely on board; but version 8 needs both disciples like you and sceptics like me.

Winston
 
Pierre-Yves Saumont
Author
Ranch Hand
Posts: 103
17
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not sure about the trade-off between being generic and being useful. Being more generic is always more useful. There might be a trade off between being generic and being productive. Being generic means potentially solving x times more problems with y times more work. So it is a question of comparing x and y.

Regarding performance, I never run benchmarks. As long as the program is fast enough, I don't care about the performance. If it is not fast enough, it is simple to rewrite some parts in a more efficient way (in terms of speed), keeping the same structure.

But saying that most of the grouping work stays on the client is absolutely not true. This is only because I did none give a fully finished example. Here is what it could look like. First, The Output class (renamed Grouper):



Then the more specific StringGrouper:



The generic Grouper class may of course be reused to define other specific groupers.

And now, the client:




As you see, it couldn't be simpler. If it is not efficient enough, one can just rewrite the necessary methods in another style. But doing so right now would be premature optimization.

Eventually, there is one last huge advantage: it's fun! Many thanks to you and Jeanne for this really interesting challenge!
 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:It strikes me also that your Deque-based solution (which is close to - but better than - what I was thinking of when I read Jeanne's post) is eminently primed for "generification". All you need to do is change String to <T> and provide it with a Stream of 'em ... no?

Almost. You also need a way to tell if a T belongs to the current group or needs to start a new one. So you need a BiPredicate<? super Group, ? super T>. The problem with that is that you're tightly coupling the client code with a specific notion of a Group. If we allow the client to build a summary of a group, the code becomes even more generic. That's why my collector takes a BiPredicate<? super S, ? super T>, where S is the type of a summary of a group. I just finished implementing it, for those interested:

 
Stephan van Hulst
Saloon Keeper
Posts: 7687
139
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Pierre-Yves Saumont wrote:I am not sure about the trade-off between being generic and being useful. Being more generic is always more useful. There might be a trade off between being generic and being productive. Being generic means potentially solving x times more problems with y times more work. So it is a question of comparing x and y.

You are correct. I knew useful was the wrong word as I was typing, but I couldn't think of anything better. Productive is what I was looking for.

Regarding performance, I never run benchmarks. As long as the program is fast enough, I don't care about the performance. If it is not fast enough, it is simple to rewrite some parts in a more efficient way (in terms of speed), keeping the same structure.

Normally I would agree, but performing defensive copies in a reduction operation is something you should avoid from the start. It's like writing the following piece of code to iterate over a list:

No Java professional would take you seriously if you wrote that, even if it's fast enough for the time being. Likewise, reducing with a mutable type should be done with collectors, never with defensive copies. I'll see if I can benchmark your code later on.

But saying that most of the grouping work stays on the client is absolutely not true. This is only because I did none give a fully finished example. Here is what it could look like. First, The Output class (renamed Grouper):

Yes, I see that your code is actually quite similar to my example, except that it's a reduction operation rather than a collection operation. Please have another cow for your efforts

As you see, it couldn't be simpler. If it is not efficient enough, one can just rewrite the necessary methods in another style. But doing so right now would be premature optimization.

I think people are going overboard with the "premature optimization" thing. Even if something is efficient enough now, doesn't necessarily mean that a more efficient method is premature optimization. Collectors were specifically designed for this problem, and looking at them as a first solution should be almost as natural as looking at iterators first to iterate over a list, even if using an indexer is fast enough. I can tell that you're highly fluent in functional languages, but the fact of the matter is that Java isn't purely functional and doesn't have a notion of referential transparency, which makes the use cases for pure reductions rare.

Eventually, there is one last huge advantage: it's fun! Many thanks to you and Jeanne for this really interesting challenge!

This is definitely true. Because of the activity in the Java 8 forum, I picked up Haskell again and I've been enjoying it a lot the last weeks.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37180
515
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm glad this turned into an interesting discussion. Because I turned out not to need to limit the size of the cell/string after all!
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!