• Post Reply Bookmark Topic Watch Topic
  • New Topic

Collating data using Streams

 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Howdy Ranchers! I have a data processing puzzle I need to solve.

I have a record of shape Collection<Map<String, String>> that represents a bunch of records where each record has the fields "Name" and "Likes". It might look like this:

[{"Name":"Tim","Likes":"CodeRanch"},{"Name":"Tim","Likes":"Twitter"}]

and the collation part would be that it should transform into

[{"Name":"Tim","Likes":["CodeRanch","Twitter"]}]

where the Likes are grouped by the Name.

I have some example code that I know is wrong but serves as a starting point. As expected, the jUnit assertions fail:

java.lang.AssertionError:
Expected: is <[PersonalLikes{name='Tim', likes=[CodeRanch, Twitter]}, PersonalLikes{name='Cathal', likes=[StackOverflow]}]>
    but: was <[PersonalLikes{name='Tim', likes=[CodeRanch]}, PersonalLikes{name='Cathal', likes=[StackOverflow]}, PersonalLikes{name='Tim', likes=[Twitter]}]>


The whole test code is posted below (You'll need jUnit and Google Guava to run it). Can anyone suggest a nice clean way to achieve what I'm after?

 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How about this?
You need to add these methods to your builder:
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You might want to use Hamcrest's containsInAnyOrder() matcher. is() will not work on collections in general, because lists and sets are not mutually comparable.
 
Rob Spoor
Sheriff
Posts: 20819
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan is right. Your own attempt simply transforms each Map into a PersonalLikes object with one like. There's no code that merges objects like these. Stephan's code uses a Collector that does just that. However, I would have added a check in withProperties that would also check for the name; either the own name should be null, or the new name should be equal to it. The reasoning is that you don't know if one Builder is created for each Map, or if one Builder is used with more than one Map. The following are both valid flows:

1)
* Create new Builder 1, add properties for the first Tim Map.
* Create new Builder 2, add properties for the Cathal Map.
* Create new Builder 3, add properties for the second Tim Map.
* Combine Builders 1 and 3.

2)
* Create new Builder 1, add properties for the first Tim Map.
* Create new Builder 2, add properties for the Cathal Map.
* Add properties for the second Tim Map to Builder 1.

Stephan now assumes that flow 1 is always used, but that's not necessarily the case.
 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great information guys. Stephan, your solution works splendidly with my simplified example. I'm now about to apply it to the actual production code I'm working on, which is conceptually the same but with more 'stuff' in the incoming data structure.

I think I get what you're saying there Rob. I'll see if I can write a test to confirm or debunk the theory.

Cows all round
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Tim :)

Rob, flow 2 shouldn't cause any problems, should it? If you add two maps to the builder, it just takes the name of the latter. And in the case of collateLikes(), the maps are guaranteed to have equal names anyway, so flow 2 would overwrite the builder's name with the same name.

I added the check in the combine() method because combining two builders with different names just doesn't make a lot of sense to me.

This was a cool question, I don't often need to write a custom collector :)
 
Rob Spoor
Sheriff
Posts: 20819
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Rob, flow 2 shouldn't cause any problems, should it? If you add two maps to the builder, it just takes the name of the latter. And in the case of collateLikes(), the maps are guaranteed to have equal names anyway, so flow 2 would overwrite the builder's name with the same name.

I added the check in the combine() method because combining two builders with different names just doesn't make a lot of sense to me.

I know that the overwrite would use the same name, but I'd simply either have the same check for both methods, or for neither. Right now it's possible (not when used in a Collector, but when used directly) to merge two sets of unrelated properties.
 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The functionality I've got here is as I desired, but interestingly not in the exact manner I expected.

It appears that the combining effect of Likes is achieved through this call in the withProperties() method:
rather than in the combine() method.

It also appears that the combine() method is never called.
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Likely because you're not using a parallel stream.

A parallel stream divides up the work, and lets a new accumulator reduce each part, after which the accumulators get combined. If work isn't done in parallel, there's only going to be one accumulator.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Cooke wrote:Howdy Ranchers! I have a data processing puzzle I need to solve.

I'm still struggling with Streams myself, but it seems to me you've made life difficult for yourself in a few ways:

1. Your input format - if that is indeed what it looks like - is very "bloated". In fact, it looks precisely like the output of a Collection<Map<String, String>>.toString() call.
Presuming each of your records is a line, I suspect you could do a bit of pre-processing (with Streams if you want) to pare it down to lines of:
Tim, Twitter
Tim, CodeRanch
...

etc, to ensure that your Consumer only has the relevant information to worry about.

2. Your PersonalLikes class looks unnecessary to me. If I was doing this, I'd make PersonalLikes a multimap that contains the result of the collation - eg, something like:and simply add a public put() method that takes a single Name,Like mapping.

Then it should be a simple matter to write a Consumer function that takes an input record and plough it into your PersonalLikes object.

HIH, and Stephan, if I'm wrong, do tell.

Winston
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, I disagree with making PersonalLikes a multi-map. It's not about what is easiest, it's what things are supposed to model. PersonalLikes is a type that encapsulates the likes of one person. I think that is a good abstraction.

I imagine the input format is so clunky because it's a direct unmarshalling from something like JSON:
Some deserializers will directly translate [] to a Collection and {} to a Map.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Well, I disagree with making PersonalLikes a multi-map. It's not about what is easiest, it's what things are supposed to model. PersonalLikes is a type that encapsulates the likes of one person. I think that is a good abstraction.

You may do, but where did Tim say that he needed it?
Tim specifically wrote:and the collation part would be that it should transform into
[{"Name":"Tim","Likes":["CodeRanch","Twitter"]}]
where the Likes are grouped by the Name.

and if you remove the redundant "Name" and "Likes", since they are only used to separate the intended key from value components, that looks exactly like a multimap mapping to me.

Furthermore, the way PersonalLikes' equals() and hashCode() methods are implemented, there is no way to ensure that a "name" is re-used without writing logic to do it. To me, it therefore fails the test of a "collator" - if that's its intended task - even if it models the contents of one of its elements.

Winston
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I understand now. We're arguing what the result of the collation should be:
Or:
In the first case, PersonalLikes is a multi-map that contains the likes of all persons. In the second case, PersonalLikes is just an object that encapsulates the likes of one person. Which is appropriate is probably up to the requirements, but if it's the former, then I would skip the creation of a custom type, and just use a third-party library to return something like a MultiMap<Person, Like>.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:In the first case, PersonalLikes is a multi-map that contains the likes of all persons. In the second case, PersonalLikes is just an object that encapsulates the likes of one person. Which is appropriate is probably up to the requirements, but if it's the former, then I would skip the creation of a custom type, and just use a third-party library to return something like a MultiMap<Person, Like>.

But it seems to me that this was a question about using Streams to perform a task, and that PersonalLikes was simply an interim step on the "road to collation". So my approach (and tell me if you think I'm wrong) was:
1. Get your "records" of Name→Like mappings (Files.lines()?).
2. Strip all redundant information.
3. (If necessary; in this case probably not) sort them.
4. Plough them into a structure that emulates a "collator".
5. Output the resulting mappings in whatever format you like.

Step 4 seems to me to be a "termiinator" function, so I doubt it can be done in one pipeline; but I can't really see how it could be otherwise since, in order to know "what a Person likes", you need to process all the input.

Winston
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:Step 4 seems to me to be a "termiinator" function, so I doubt it can be done in one pipeline...

Oooh, wait a minute (sorry if I'm being thick here). If we DO sort the input, then we can output a PersonalLikes object each time a name changes...Hmmm.

Now that makes sense to me as a single-pipeline solution. I'm just not sure how I'd do it.

Winston
 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The input data structure looks like a Collection<Map<String, String>>.toString() because that's exactly what it is. I copy and pasted the console output from my Unit Test to show the data content, but the data really is a Collection<Map<String, String>>. Unfortunately that data is constructed elsewhere and arrives in my application as it is, so the clunkiness is inflicted upon us. In fact, it's the clunky nature of it that's causing me to need this collating nonsense at all.

In reality the data structure has more fields in it. I'm reluctant to post the exact data, it being proprietary and stuff, but it's essentially the example I gave with a bunch more fields. A bit like this with a few more fields:

The processing is to marshal that data into a custom type. The 'one Like per record' is a nuisance so need to collate them into a single record where all other data items are equal.

I'm even less informed on Streams than you guys so I'm unsure if any of this makes a difference to the discussion. I really appreciate the help, I'm learning a lot here.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Cooke wrote:In reality the data structure has more fields in it. I'm reluctant to post the exact data, it being proprietary and stuff, but it's essentially the example I gave with a bunch more fields. A bit like this with a few more fields:

OK. Finally I can put my "data admin" hat on - where I'm a bit more comfortable.

First: Is there anything in those bracketed groups that can be used as a unique key (eg, FirstName + LastName)?.

Second: Is this a recognised format, like Ajax? And if so, could you simply read it into an un-normalized list of Person (or InputPerson) objects?

If so, you could probably use Streams to aggregate them into a Set of actual people, but personally, I think I'd just do it the old-fashioned way: with a for-each loop.

If not, I suspect you'll need to convert your input to some "streamable" format (although Stephan may prove me wrong), and if you're going to do that, why not just convert it directly to a Set (or Map) of Person objects with all their likes combined, and forget about Streams?

It's the "hierarchical" nature of this data that bothers me ([ {..., ...}, {...}, {...}, ... ]) and the amount of redundancy in it. It can probably be parsed fairly easily, but I'm not sure if Streams are the best way to do it - although I'm sure Stephan will tell us if I'm wrong.

Winston
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:Second: Is this a recognised format, like Ajax? And if so, could you simply read it into an un-normalized list of Person (or InputPerson) objects?

Sorry. You said it was the output of Collection.toString() - although it seems to have been a bit "prettified" (grouped and indented) as well.

Winston
 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:... although it seems to have been a bit "prettified" (grouped and indented) as well.

Yes, that was my elaboration for readability but I'm just confusing matters instead.

The integration point with this other system is a programmatic one so the data is received as a Collection<Map<String, String>>. Actually in reality, to make it much more icky "flexible", the incoming data is a Collection<Map<String, Object>> but we know for sure in our application that the Map value is always a String.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Cooke wrote:
Winston Gutkowski wrote:... although it seems to have been a bit "prettified" (grouped and indented) as well.

Yes, that was my elaboration for readability but I'm just confusing matters instead.
The integration point with this other system is a programmatic one so the data is received as a Collection<Map<String, String>>. Actually in reality, to make it much more icky "flexible", the incoming data is a Collection<Map<String, Object>> but we know for sure in our application that the Map value is always a String.

Oh, OK. Well that actually makes thing simpler doesn't it, because you have a POJO.

I think my old-fashioned, non-functional solution, assuming "FirstName" + "LastName" is a unique key, would be something like this:
You could possibly make it more "functional", but I don't really see what it buys you, since you already have a Java Collection. Paralleling perhaps? I'm sure Stephan will tell us if he's around.

Winston
 
Tim Cooke
Sheriff
Posts: 3293
153
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It may well be my desire to stick with the familiar, but I can get behind your approach Winston. I'll experiment with both for comparison and get my team involved for a code review. I'll report back with which approach I go for, but that might not be until next week now as I've moved on to the next pressing task.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Cooke wrote:I'll experiment with both for comparison and get my team involved for a code review. I'll report back with which approach I go for, but that might not be until next week now as I've moved on to the next pressing task.

No probs, Good luck with whatever you decide.

Winston
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:HIH, and Stephan, if I'm wrong, do tell.

although I'm sure Stephan will tell us if I'm wrong.

I'm sure Stephan will tell us if he's around.


Am I that argumentative?

Well, since you asked for it, there is no reason to avoid using a Stream when "you already have a Java Collection". The nice thing about functional code is that it's more declarative, and if you're familiar with it, can be more clear than procedural code. My usual flow is to start writing a functional solution, and if the code becomes too arcane, I switch to a procedural approach.

What you achieved in that code is a grouping of Persons by name, with each group reduced to one Person. That's *exactly* a grouping operation with a downstream reduction:

I told the program what to do, not how to do it. The dirty details of instantiating objects, initializing variables, and possible concurrent execution are all hidden. To me, that's the advantage of streams.
 
Winston Gutkowski
Bartender
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Am I that argumentative?

Not at all. You just obviously feel strongly about functions. Anyway, I enjoy a good argument sometimes.

I told the program what to do, not how to do it.

Hmmm. Maybe, but it looks quite tortuous to me. My code has a for-each loop, a couple of method calls and one if statement; and anyone with a bit of Java knowledge is going to understand precisely what it does, because it's as plain as a pikestaff.

BTW, I assume that 'groupingBy()' is one of the 3 overloaded methods in Collectors. Do you not have to supply the class name?

The dirty details of instantiating objects, initializing variables, and possible concurrent execution are all hidden. To me, that's the advantage of streams.

Now that last one (concurrency) IS a major advantage - even I can see that. .

I also get that a functional solution may be more scalable, but lord it's not pretty. Reminds me of a particularly arcane regular expression with nested groups and zero-length look-forwards.

It also seems that you still have to supply the method names for the other stuff.

However, cheers for that. Seems I'll have to bone up a bit more on "grouping". Have a cow.

Winston
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:Hmmm. Maybe, but it looks quite tortuous to me. My code has a for-each loop, a couple of method calls and one if statement; and anyone with a bit of Java knowledge is going to understand precisely what it does, because it's as plain as a pikestaff.

Yes, I do admit that Java's syntax doesn't lend it very well to functional code. In a more functional language it could look like this instead:
The difference is mostly caused because functional languages use juxtaposition as function application (f x instead of f(x)), and Java doesn't support currying.

BTW, I assume that 'groupingBy()' is one of the 3 overloaded methods in Collectors. Do you not have to supply the class name?

Yes, but I tend to do a static star import on Collectors, Comparator, junit's Assert and hamcrest's Matchers because it leads to much more fluent code.

It also seems that you still have to supply the method names for the other stuff.

You mean Person::new and LinkedHashMap::new? The latter is only necessary if you explicitly want to specify the type of map used. The ::new part sadly is necessary because Java doesn't understand implicit type conversions. You should really see it as if you're providing a type, and not a function.
 
Stephan van Hulst
Bartender
Posts: 6583
84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And thanks for the cow!
 
Dewey Allen
Greenhorn
Posts: 3
IBM DB2 Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Apologies for posting 2-3 weeks after the discussions on this thread ended. I am studying for the OCP exam, trying to better understand streams, and found this thread to be helpful. Thank you.

For what it's worth, I combined the code / code snippets that had been posted into a working example with a few edits / additions. Requires Google Guava in the classpath (guava-19.0.jar).

 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!