• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Junilu Lacar
Sheriffs:
  • Rob Spoor
  • Liutauras Vilda
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Piet Souris
  • Tim Holloway
  • Jj Roberts
  • Stephan van Hulst
Bartenders:
  • Himai Minh
  • Carey Brown
  • Frits Walraven

Should Streams be Cloneable?

 
Master Rancher
Posts: 3948
51
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:

Mike Simmons wrote:It's entirely possible that what is really needed here is a list (or Set) of all the longest names.


Which isn't a particularly easy task either. Adding is dirt simple, but removal isn't - particularly if you want it to run faster than O(n) - and even harder if you want the structure as a whole to act like a List.


I'm not sure what the problem is here. We don't need to remove at all, do we? But if we did, seems like it would be O(1) for each element to be removed. If you want elements in insertion order, that can be achieved easily too with the following identity:

Anyway, since the stateless reduce() ended up less efficient than I'd hoped, I'd probably just take one of the Collector implementations already shown and change the new ArrayList() to the identity just shown. Seems like it's pretty much a solved problem at this point. Or have I misunderstood what you're saying?
 
Saloon Keeper
Posts: 13090
281
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Aye, using a Collector, in the worst case (the very last String is the single longest) you need 2n-1 additions and deletions, which is O(n).
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:// This purports to be an intermediary operation, but actually closes the input stream before we call a terminal operation.
Stream<String> longest = getLongest(strings);

The solution to this problem is to have getLongest() return a new Stream that does 'late binding'. I'm afraid the code is not pretty:


OK, so are you saying that the whole idea of our Collector above is a non-starter, because I simply don't follow.
If Rob's (and your) Collector can use:
to produce a List, then surely you can simply add stream() to it to continue the "pipeline" (which I realise isn't a single one any more), viz:and if you can do that, why can't you just wrap the whole thing in an appropriately documented function (which I realise probably isn't a "function" in the general sense)?

Or indeed, to save time and effort, go back to my original idea, slightly modified:or does that Stream get closed too? I certainly didn't ask for it to be.

It seems to me that if I can't easily output a Stream from a method, we have to either:
(a) Create and manipulate and terminate/collect a Stream in a single pipeline - which suggests that every time I want a "longest" filter, I have to copy and paste the code.
(b) Jump through a pile of hoops to do something "functionally" that (IMO) ought to be dead simple.

Or is a "longest" filter (or something like it) simply not something that crops up very often in functional programming?
I do remember, back in the early '80's, being surprised that something as simple as getting the "top" (or first) n rows of an SQL query was incredibly involved (it's been "fixed" since then...sort of), and it sounds like this may be a similar issue.

Winston
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:I'm not sure what the problem is here. We don't need to remove at all, do we? But if we did, seems like it would be O(1) for each element to be removed. If you want elements in insertion order, that can be achieved easily too with the following identity:


No it can't, because the requirement - OK, my interpretation of it - was to return the longest Strings in a List; not the longest distinct Strings.

To me, Piet's HashMap<Integer, List<String>> is pretty much what I see as a generalised structure for maintaining Strings by length, but its remove() function is still going to be O(n)-ish, and it also can't preserve Strings in insertion order.

Seems like it's pretty much a solved problem at this point. Or have I misunderstood what you're saying?


I was probably going off at a tangent - a common habit of mine . I was just pondering on Piet's Map, and wondering whether it could be modified to also maintain insertion order. It's probably not wildly important; but it just struck me as an interesting problem.

Winston
 
Stephan van Hulst
Saloon Keeper
Posts: 13090
281
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:OK, so are you saying that the whole idea of our Collector above is a non-starter, because I simply don't follow.


No, it works. Your solution will also work, but it will violate the principle of least astonishment. When you 'continue a pipeline', you expect nothing to get evaluated yet. Evaluation should happen upon a terminal operation.

Your method looks like an intermediary stream operation, but it actually eagerly evaluates the stream.

to produce a List, then surely you can simply add stream() to it to continue the "pipeline" (which I realise isn't a single one any more), viz:
...
and if you can do that, why can't you just wrap the whole thing in an appropriately documented function (which I realise probably isn't a "function" in the general sense)?


Then why return a Stream at all? Streams are meant to facilitate functional programming.

Or indeed, to save time and effort, go back to my original idea, slightly modified:
...
or does that Stream get closed too? I certainly didn't ask for it to be.


That method is fine.

It seems to me that if I can't easily output a Stream from a method, we have to either:
(a) Create and manipulate and terminate/collect a Stream in a single pipeline - which suggests that every time I want a "longest" filter, I have to copy and paste the code.
(b) Jump through a pile of hoops to do something "functionally" that (IMO) ought to be dead simple.


It's easy to create a new stream operation (in the form of a method that accepts a Stream and returns a Stream) if it consists of only non-terminal operations.

If you have to terminate the stream anywhere in order to continue processing, then it no longer makes sense to return a Stream, because you've already loaded all data in memory. To take Paul's idea of comparing Stream to InputStream: A method that accepts a Stream, terminates it, and returns a new Stream is a bit like reading data from an InputStream into a Collection, processing it, storing it again in a different file, and then returning an InputStream of the new file. Why not just return the processed collection?

Or is a "longest" filter (or something like it) simply not something that crops up very often in functional programming?


In many functional languages, lazy evaluation is built-in. There's no difference between a list and a stream. I can't really explain it in Java terms, I can only recommend trying such a language.

As for getting the longest of a bunch of strings, I don't think that's such a common operation anywhere.

I do remember, back in the early '80's, being surprised that something as simple as getting the "top" (or first) n rows of an SQL query was incredibly involved (it's been "fixed" since then...sort of), and it sounds like this may be a similar issue.


I think there's a difference though. In the end, Java is an imperative language, and Stream is just a tool to get some pieces of functional programming done. Java simply isn't meant to do everything in one big functional pipeline. It's important to realize what the limitations of streams are, and where it makes sense to use them.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:No, it works. Your solution will also work, but it will violate the principle of least astonishment. When you 'continue a pipeline', you expect nothing to get evaluated yet. Evaluation should happen upon a terminal operation.

But it will, just on the new Stream that I've created...but I take your point.

Then why return a Stream at all? Streams are meant to facilitate functional programming.

Yes, but within the context of a language that is already Object-Oriented surely. I may have very good reasons for wrapping my operation in an object that have more to do with information (and specifically implementation) hiding than "functionality".

That method is fine.

Oh good. Phew.

It's easy to create a new stream operation (in the form of a method that accepts a Stream and returns a Stream) if it consists of only non-terminal operations.

OK. I guess I just have to live with that.

If you have to terminate the stream anywhere in order to continue processing, then it no longer makes sense to return a Stream, because you've already loaded all data in memory.

Really? That doesn't make sense. If I use a BufferedReader instead of a Stream to process, say, 10 billion lines from a log file, that object only needs enough memory to store the longest line in the file (although for practical purposes it obviously uses a bit more). Why should a Stream be any different? I'm evaluating it, not reading it into memory.

In many functional languages, lazy evaluation is built-in. There's no difference between a list and a stream. I can't really explain it in Java terms, I can only recommend trying such a language.

Yeah, it definitely looks like I'm going to have to read more.

As for getting the longest of a bunch of strings, I don't think that's such a common operation anywhere.

Perhaps not, not but many operations, like "top 10 lists" and sorting are essentially "two-pass" (or in the latter case, possibly multi-pass) operations.
Now clearly, Java have saved us the bother of writing a late-binding sort, because of the obvious difficulties, and many others are simply a "sort + some function". But this particular case isn't. It's simply a filter that relies on a predicate that can't be determined without first evaluating the entire Stream.
We certainly could do it with a "sort in descending order of length" + "filter while strings are the same length", but that's introducing a complex function to a problem that is essentially linear.

In "objective" terms I see it like this: You have something called a Stream that has to be produced by something, be it a function, collection, a file channel, or anything else that can produce a stream of data objects. Personally, I'd call that a Streamable and require it to implement a stream() method; and it also seems reasonable to me that a Stream created by a Streamable could then contain a reference to its source, which would allow de-facto cloning.
I do take your point about something like the Fibonacci function you wrote, but you could equally well imagine a Streamable Fibonacci function that simply spews out values from Fₘ to Fₙ (or even an open-ended Stream) which is a style I remember from Smalltalk.

I think there's a difference though. In the end, Java is an imperative language, and Stream is just a tool to get some pieces of functional programming done. Java simply isn't meant to do everything in one big functional pipeline. It's important to realize what the limitations of streams are, and where it makes sense to use them.


Very true. It's also very new, and I suspect we may see some new stuff (like the v9 method Mike alluded to) to plug gaps in the current API.

I've also got to get clear in my mind the exact differences between Streams, Suppliers and Spliterators, because right now they're "doin' my head in".

As always, thanks for your sage advice and infinite patience. Have a cow.

Winston
 
Stephan van Hulst
Saloon Keeper
Posts: 13090
281
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:If I use a BufferedReader instead of a Stream to process, say, 10 billion lines from a log file, that object only needs enough memory to store the longest line in the file (although for practical purposes it obviously uses a bit more). Why should a Stream be any different? I'm evaluating it, not reading it into memory.


Using BufferedReader to process each line in a huge text file is like performing a forEach() on a Stream, which is not functional in the first place; forEach() necessarily has side-effects, otherwise it would be useless. We're talking about performing a reduction on a stream, the result of which will be in memory. If the the only other thing you're going to do with that result is stream() it, you might as well just return the reduced value and leave it to the client to decide if they want to continue with stream operations.

[...] and it also seems reasonable to me that a Stream created by a Streamable could then contain a reference to its source, which would allow de-facto cloning.


I'm still not quite sure what this actually means. Maybe you can write a pseudo-code example of something you wanted to do with such a cloning operation?

It's also very new, and I suspect we may see some new stuff (like the v9 method Mike alluded to) to plug gaps in the current API.


Yeah, I actually haven't really looked at what they'll add to Java 9, but I'm anxious for the next release.

I've also got to get clear in my mind the exact differences between Streams, Suppliers and Spliterators, because right now they're "doin' my head in". :wink:


  • Stream is a pipeline of operations bound to a data source. It's a bit like an unevaluated expression.
  • Spliterator is the thing that actually performs the evaluation when a terminal operation is performed.
  • Supplier is just a functional interface that represents a nullary function.

  • As always, thanks for your sage advice and infinite patience. Have a cow.


    You're very welcome, and thank you :)
     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:Using BufferedReader to process each line in a huge text file is like performing a forEach() on a Stream


    I beg to differ. A BufferedReader is simply a channel to an input source (usually a File, but it doesn't have to be) which buffers (or queues) that input for physical I/O efficiency, and also allows me to process the data in larger units than a byte or int. How I use it though is entirely up to me and, for all I know, it's also lazily-instantiated (although I suspect not entirely). The fact that it's usually used to do the equivalent of a "for-each line" has nothing to do with what it is.

    Resetting or re-positioning a BufferedReader might be a problem though - I do see that - so we have the same problem of "two-pass" operations on a BR, as we do with Streams.

    forEach() necessarily has side-effects, otherwise it would be useless. We're talking about performing a reduction on a stream, the result of which will be in memory.


    Right. The result, not the Stream itself. And since we're performing a reduction, I would assume that that would be smaller (and possibly significantly smaller) than it.

    If the the only other thing you're going to do with that result is stream() it, you might as well just return the reduced value and leave it to the client to decide if they want to continue with stream operations.


    But that's not sufficient. I have a function called "get longest Strings" that, given a Stream<String>, does indeed return a reduced set of them. Unfortunately - and what seems to be our sticking point - is that it also has to process the entire Stream in order to work out what set to return. In an OO world, I might be receiving that Stream from a system (or program) that I have no control over - it's simply (as far as I'm concerned) a Stream.

    [...] and it also seems reasonable to me that a Stream created by a Streamable could then contain a reference to its source, which would allow de-facto cloning.


    I'm still not quite sure what this actually means. Maybe you can write a pseudo-code example of something you wanted to do with such a cloning operation?



    OK, let's assume that java.util.Collection implements my Streamable interface, viz:
    Now assume that we have a static nested class of Stream:and now let's assume that a Stream contains internal information about its "current position" as an element index. Then clone() becomes something like:
    The latter method is optional, but none of this seems outlandish to me - although I suspect you'll tell me that it violates all sorts of "functional" principles. But NOW I can write my longest() method as I originally wanted to, viz:
    Hope I've explained myself. Obviously clone() would need to be documented with a caveat about cloning "compromised" Streams - ie, a Stream that might not have a 1:1 relationship with its source in terms of elements - but I suspect that could be sorted with a canClone() method.

    Winston
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    For that, I believe you just need to write the method to take a Supplier<Stream> as input:


    Prints "three" as expected. Of course, it may be inefficient to create and traverse the stream twice. But you can certainly do it.
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Or, to potentially simplify the client call, we can separate the source of the data from the function that knows how to make it into a stream:

    Prints
     
    Saloon Keeper
    Posts: 4560
    182
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Great! And having come this far, it should now be a doddle to write the 'longestButOne' method.
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    If there's some motivating reason, sure. I'm sure you can modify the requirements to push it more towards something where the Map<Integer,List<String>> solution makes sense, if you like. But why bother?
     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Mike Simmons wrote:For that, I believe you just need to write the method to take a Supplier<Stream> as input:


    Aha! Now that looks more like it. What a great idea! Have a cow.

    Prints "three" as expected. Of course, it may be inefficient to create and traverse the stream twice. But you can certainly do it.


    Well I assume that one of the reasons Streams are lazily instantiated is to save time, and in this case, unless you transform it into a Map like Piet's, or a List like the one we created in our Collector, you have no choice about traversing twice, whether the solution is functional or procedural - so the time it takes to create one is irrelevant.

    I think that's about as close as we're going to get...unless anyone has any better ideas.

    Winston
     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Mike Simmons wrote:Or, to potentially simplify the client call, we can separate the source of the data from the function that knows how to make it into a stream:


    Surely, all we need is a wrapper, viz:and then we can set up any variant we like:and call your longest() method exactly as before.

    And I'd care to bet there's an even more "functional" way to do it too (which I suspect is what you were showing me in your last post; I have to admit, I didn't quite follow it all).

    But now I feel like we're cooking with gas. Thanks a lot!

    Winston
     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Winston Gutkowski wrote:and then we can set up any variant we like
    ...
    and call your longest() method exactly as before.


    Ah. I see a potential flaw in my example above, but I think it's just a matter of documentation. A BufferedReader may not be able to create two Streams, or it might create an empty one the second time lines() is called; as opposed to a File or Path (or Collection), which should be able to create as many as we want.

    @Mike: is that what you were referring to when you talked about "separating the source from the function"?

    Winston
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Yes, that's the sort of thing I was talking about. Whether we create wrapper classes or static utility functions, we can make variants customized for particular sources, so that most clients will not need to think about the syntax of how to write a function to create a stream.

    BufferedReader is an unfortunate example, because there's no way to "reset" the stream. (Yes, there is a reset() method, but it depends on mark() and is limited by the buffer size; not good enough for our purposes I think.) If the source is a file, then it's easy enough to get a stream of lines using Files.lines(), for example. But for an arbitrary InputStream or Reader, we don't have that. How to recreate all the data that was read from a port previously? If that's really necessary, you could create custom FilterInputStream and FilterReader implementations that intercept everything read (when it's first read) and copy it to a file, while also passing the data on to the caller. Subsequently you can create a new Stream by reading from that file.

    Winston Gutkowski wrote:Well I assume that one of the reasons Streams are lazily instantiated is to save time, and in this case, unless you transform it into a Map like Piet's, or a List like the one we created in our Collector, you have no choice about traversing twice, whether the solution is functional or procedural - so the time it takes to create one is irrelevant.


    It's not just the time to create it, but to read all the data from it a second time. (Or, write and read, if using the idea in my last paragraph.) If that involves repeated file access, you may be doubling the time for the algorithm, or more. If there's no choice, fine, there's no choice - but I would argue that there is a benefit to writing your code in a way that only one pass through the data is necessary. If possible. Becoming conversant in Collector and reduce() techniques helps that considerably.

    Glad you liked the Supplier<Stream<T>> idea though.
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    To clarify one point: you don't need a separate wrapper class at all; you can also use static utility methods like these:

    or just

    Checked exceptions are a PITA here, but I'm long past any sense of guilt at getting rid of them this way.
     
    Mike Simmons
    Master Rancher
    Posts: 3948
    51
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Mike Simmons wrote:How to recreate all the data that was read from a port previously? If that's really necessary, you could create custom FilterInputStream and FilterReader implementations that intercept everything read (when it's first read) and copy it to a file, while also passing the data on to the caller. Subsequently you can create a new Stream by reading from that file.


    On further thought, it's more complicated than this. If we create multiple "identical" streams, we can't assume that they will be traversed at the same rate, or consumed at all. And we can't assume that the "first" stream will always be the first one that consumes a given portion of data. So rather than let the first stream use a FilterReader and copy its data for consumption by later streams, we need to assume that any of the streams could be become first. Maybe we create a private "leader" stream operating in a separate thread (or set of threads) to consume and copy as much of the source as is available, and then all the stream used by clients will follow this, consuming from files whatever has been written to file, and blocking on the leader once they catch up with the leader.

    Mmmm, it's getting complicated, aside from potentially large disk usage. Which goes back to why several of us were so resistant to the idea of cloning an arbitrary stream in the first place. However, if there is an easy way to "clone" a given stream (e.g. from an array, Collection, file, or generator function) then Supplier<Stream<T>> gives an easy way to connect that cloning method to the "longest" functionality.
     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Mike Simmons wrote:Mmmm, it's getting complicated, aside from potentially large disk usage. Which goes back to why several of us were so resistant to the idea of cloning an arbitrary stream in the first place.


    The Java docs actually call this - or at least what I originally envisaged - "forking" (nudge-nudge, wink-wink) and specifically state that you can't do it, which is fair enough.

    However, if there is an easy way to "clone" a given stream (e.g. from an array, Collection, file, or generator function) then Supplier<Stream<T>> gives an easy way to connect that cloning method to the "longest" functionality.


    Yeah. And it's all I was really looking for, TBH: A way of separating a Stream source from functions that might use it.

    And the reason for it - quite apart from any "cloning" requirement - is that it reduces coupling. Rather than a program that I may have no control over having to hand me a Collection, it can hand me a Source<Whatever>, and later on, if need be, change the implementation of that Source without having to worry about anything downstream (no pun intended).

    It also occurs to me that a 'Source' doesn't even need to be "cloneable". You simply specify, at construction time, whether get() can be called on it more than once.

    BTW, I think I finally worked out the business of your static method, because I was trying to work out how I'd code a factory method for Source.
    My class, as it stands right now, defines a supplier type, which I don't really want clients to see, so I need a mechanism for hiding it.
    Doing it the old-fashioned way, I'd probably do something like:and now our declarations become:Like I say, I'm sure there's a more "functional" way of doing it, but I'm still getting my head around it all.

    But thanks muchly for all your help.

    Winston
     
    "To do good, you actually have to do something." -- Yvon Chouinard
    Thread Boost feature
    https://coderanch.com/t/674455/Thread-Boost-feature
    reply
      Bookmark Topic Watch Topic
    • New Topic