• Post Reply Bookmark Topic Watch Topic
  • New Topic

Java 8 - Filter one list based on another  RSS feed

 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Say you have a book, text, or whatever that you want to do frequency analysis on, but FIRST, you want to filter out words not to consider.

Below, I have three possible approaches I've been working on, but none correctly filters the "allWordsList" to avoid the "avoidWordList" words.

The output should avoid words "one" and "two", but when running this test code, I either get the entire "allWordsList" or nothing.

Thanks in advance for suggestions.

- mike


 
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Got one of the ones above to work, finally.

Update on word list definitions:

  List<String> avoidWordList = Arrays.asList("one", "two");
  List<String> allWordsList = Arrays.asList("There", "were", "two", "people", "who", "always", "said", "one", "thing", "over", "and", "over");

.
.
.
// show all words except those in the avoidWordsList

        allWordsList.stream()
                .filter(word -> !avoidWordList
                .stream().collect(Collectors.toSet())
                .contains(word)).collect(Collectors.toList())
                .forEach(System.out::println);

--

All the permutations of putting these streaming pieces together can be very confusing and time consuming.

-- mike
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That algorithm performs in O(m*n) though, where m is the size of your avoid list, and n is the size of your word list. Why not just build your avoid set right away?

This algorithm runs in O(n).
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:That algorithm performs in O(m*n) though, where m is the size of your avoid list, and n is the size of your word list. Why not just build your avoid set right away?

This algorithm runs in O(n).


That's cool. How can you tell my posted code runs in O(m*n)? I don't see that.

There are a bazillion ways to put these streams together (which don't work as expected in many cases) and your posted one works fine -- though I don't know how long it would have take me to stumble on to your nice code without you posting it.

I really like the declarative model, but until you're an "expert", it takes more time than just writing the $*)#*$@# (imperative) code.

Thanks,

- mike
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your algorithm runs in O(m*n), because for every word in your n-sized word list, the filter rebuilds the entire m-sized avoid set, and then throws it away again. When you build the set only once, the complexity becomes O(m+n), which is the same as O(n) if the avoid set is always shorter than the word list.

you should consider most terminal operations on streams (like collect()) to be equivalent to a for-loop. That means that if you perform a terminal operation inside a stream pipeline that is also terminated, it's essentially equivalent to a nested for-loop. Your method is equivalent to:

It all becomes much easier once you get a lot of practice with all of the stream operations and collectors. After a while all these operations become very natural to use, and piping them together will feel as straightforward as writing procedural code.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Your algorithm runs in O(m*n), because for every word in your n-sized word list, the filter rebuilds the entire m-sized avoid set, and then throws it away again. When you build the set only once, the complexity becomes O(m+n), which is the same as O(n) if the avoid set is always shorter than the word list.

You should consider most terminal operations on streams (like collect()) to be equivalent to a for-loop. That means that if you perform a terminal operation inside a stream pipeline that is also terminated, it's essentially equivalent to a nested for-loop. Your method is equivalent to:

It all becomes much easier once you get a lot of practice with all of the stream operations and collectors. After a while all these operations become very natural to use, and piping them together will feel as straightforward as writing procedural code.


Stephan,

That's a very helpful explanation - especially helping me better understand terminal operations -- and at the right level for where I am.

Thank you very much for your patience and terrific help!

- mike
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!