• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Select record containing duplicate word values disregarding order/position?

 
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys,

Is this possible to select in SQL: I have a column words_list (contains thousands of records). The data in that column contains words separated by whitespace. I need to select records in that column that occurred twice or more, disregarding the order of words per row.

For example,
ball blue clock
clock blue ball
computer
computer macbook
radio

If I run a select statement, it will retrieve either the "ball blue clock" or "clock blue ball" because they match each words per row, disregarding the order of words. "computer" and "computer macbook" are not retrieved because there's no "macbook" on the other data.


I'm not sure if is this doable in SQL.
 
Rancher
Posts: 1093
29
Netbeans IDE Oracle MySQL Database Tomcat Server C++ Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It is doable, but not as easily as in some places. I would start by segregating the words into their own table and breaking them down into a single column. You can then do a query based on the original column containing the word from the list and then group by the word field in the new table. It would be a screwy join, but it should work.
 
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I wonder if SQL is the best tool for this job...
 
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As Les and Roel pointed out, this is a bit tricky with plain SQL.

One work-around would be to create an extra column sorted_words on your table, and populate that with the sorted list of words from the words_list column. You might also want to make sure all the words are converted to the same case e.g. lower-case. You would need to populate this column whenever you insert/update values in the words_list column, and you'd need to run a one-time conversion script to populate the sorted_words column for the data that's already in your table. How you do that is up to you - "an exercise for the reader".

Alternatively, if you have the opportunity, you could simply make sure the existing words_list column always contains the sorted list of words in the first place.

Once you've done this, your data will look like this:

words_listsorted_words
"ball blue clock""ball blue clock"
"clock blue ball""ball blue clock"
"computer""computer"
"computer macbook""computer macbook"
"radio""radio"


Now you've got the sorted words for each record, you can run a simple SQL query to find the duplicate entries:


This will return the sorted_words that exist in multiple records:

sorted_wordsCOUNT(*)
"ball blue clock"2


You can probably wrap this basic query in some more fancy SQL to fetch the original words_list values for each record that has this sorted_words value, or just do it as a second query using the results of the GROUP BY query. For example, the following query works on PostgreSQL, but might not work on other databases:

words_list
"ball blue clock"
"clock blue ball"

You'll also want to index the words_list and sorted_words columns to improve query performance.
 
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:As Les and Roel pointed out, this is a bit tricky with plain SQL.

One work-around would be to create an extra column sorted_words on your table, and populate that with the sorted list of words from the words_list column. You might also want to make sure all the words are converted to the same case e.g. lower-case.



Great idea! Although, no reason for a new column when a query can do it dynamically. Though, that query is a bit complicated. The basics are, split the words, order them, and put 'em back together. There are many ways to do this, and it depends on the RDBMS amongst other things, so here's one solution, using recursive SQL and analytic functions. I wrote it for Oracle, but can be switched to SQL Server, for example, relatively easily. I differed in the duplicate finding, changing your idea of a join into an analytical function with a comparison, which i think is easier to understand, especially given the rest of the query.

Note to moderators: Because of the complexity inherent in recursive queries and analytical functions, especially here where there are more than one, and they work together, i do not think i can offer simple suggestions. Therefore, i am providing a solution. If this is not allowed, please edit the post accordingly, or send me suggestions. I will be happy to comply.

 
Roel De Nijs
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Brian Tkatch wrote:Note to moderators: Because of the complexity inherent in recursive queries and analytical functions, especially here where there are more than one, and they work together, i do not think i can offer simple suggestions. Therefore, i am providing a solution. If this is not allowed, please edit the post accordingly, or send me suggestions. I will be happy to comply.


The CodeRanch policy is against providing a fully functional solution because that's rarely very helpful as you can copy/paste the code without exactly knowing what this code actually does. This nice quote illustrates perfectly what we are trying to do here at CodeRanch: Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime. But I agree such queries are very complex and providing simple suggestions to end up with such a complex query is probably very hard and might even be mission impossible.

But it might be very helpful to the OP (and other ranchers) if you could provide some explanations about the different parts of this complex query. That will probably make it easier to digest and understand such an enormous query So I'm happy to let this one pass if some additional explanations about different parts are added as well. Then the OP (and other ranchers) have a much better idea about what's going on and if it's needed they can comment and ask some follow-up questions (if needed). This way it will still be a very instructive experience for everyone :thumbup (I did something similar with a much more simple query here )
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Roel De Nijs wrote:But it might be very helpful to the OP (and other ranchers) if you could provide some explanations about the different parts of this complex query. That will probably make it easier to digest and understand such an enormous query



I'll be happy to. Could you give me some pointers on what to explain? Chris Webster provided the idea for the solution, which is what i implemented.

As the post says "The basics are, split the words, order them, and put 'em back together." To that end, the CTE are named by what they do: Separate_The_Words, Order_The_Words_With_An_Id, Concatenate_The_Words_By_Id, and Identify_Duplicates. Would it help to provide more detail on how recursive queries or analytical functions work? I would be happy to help, though, i'd think it would be more helpful in a separate thread dedicated for that purpose. They're only easy after they "click".
 
Roel De Nijs
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Brian Tkatch wrote:I'll be happy to. Could you give me some pointers on what to explain? Chris Webster provided the idea for the solution, which is what i implemented.


After having a closer look at this massive query myself and trying to understand it myself I think it might be useful to explain a bit about what's the purpose of ROW_NUMBER and PARTITION. I think other functions like INSTR and SUBSTR are self-explanatory based on their names.

And because not everyone has an Oracle environment available to run this query (like me ), it might be useful (if possible) to show (a part of) the result table after executing the different parts of this query. The result of the first step (Data(Words)) is quite obvious. But what's the result if you execute the next step (Separate_The_Words)? And similar for the next steps. I think that might be useful for 2 reasons. First because a concrete example with actual data makes it easier to understand the code. And secondly because if a rancher wants to convert this into a similar query for MySQL or SQL Server, he/she knows the different temporary results which probably makes it easier to troubleshoot as you can verify if your temporary result is similar or completely different.
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Roel De Nijs wrote:After having a closer look at this massive query myself and trying to understand it myself I think it might be useful to explain a bit about what's the purpose of ROW_NUMBER and PARTITION. I think other functions like INSTR and SUBSTR are self-explanatory based on their names.

And because not everyone has an Oracle environment available to run this query (like me ), it might be useful (if possible) to show (a part of) the result table after executing the different parts of this query. The result of the first step (Data(Words)) is quite obvious. But what's the result if you execute the next step (Separate_The_Words)? And similar for the next steps. I think that might be useful for 2 reasons. First because a concrete example with actual data makes it easier to understand the code. And secondly because if a rancher wants to convert this into a similar query for MySQL or SQL Server, he/she knows the different temporary results which probably makes it easier to troubleshoot as you can verify if your temporary result is similar or completely different.



I can show the data tomorrow. Right now i will explain what those do, and how to convert this to run on SQL Server.

To run on SQL Server, remove the "FROM Dual"s. Dual is an Oracle table with one record in it. And, unlike SQL Server, all queries in Oracle must use a table. SQL Server does not have that requirement, and it can be removed. Alternatively, just add a table called Dual with one record. Oracle's Dual has a column called Dummy with a value of 'x'.

ROW_NUMBER() adds row numbers. However, as an analytical function, it must be told what data to work on. PARTITION BY is optional, and it specifies what group to add to. ORDER BY is required, so it knows in which order to add the numbers. A simple example which i hope will be self explanatory:



In analytical functions, PARTITION BY acts GROUP BY without affecting the rest of the query. (As if it were a sub-query with a GROUP BY. ) The purpose in this query was to identify the ordered words so they could be put back together in the next query.





 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Brian's given a clever solution here, without introducing extra columns to the underlying table.

My concern would be around indexing and query performance. If you are doing all the list-parsing, sorting and grouping etc in memory at runtime, then you may lose the benefits of any indexing on the underlying data, because you are no longer working on the stored data. I think GROUP BY performs a sort, which is more efficient on indexed data, so it may be better to be able to GROUP BY (or PARTITION BY) a stored indexed value. There is probably a trade-off here e.g. adding an extra table column for the ordered list of words, or simply storing the data as a sorted list in the first place when it is created, introduces extra processing at write-time. But it makes your indexing and query performance much easier to manage at read-time. No free lunches, either way!
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:My concern would be around indexing and query performance.



The OP mentioned thousands of records. That's nothing, especially if the words are few. If it gets into millions, or many more words, there might be concern, which a test would show. Even so, adding a column means the code must maintain its integrity, and that is something prone to failure. I would strongly recommend against it. Nonetheless, if the column is wanted, but managed by the database, there are many ways to do that, based on the RDBMS, of course. Oracle can use a materialized view, SQL Server, an indexed view, or a stored procedure can maintain in by being the only method of insert or update, Ultimately, a trigger could be used, but i believe triggers to be an even worse problem.

Ultimately, the query can probably be sped up in a few ways anyway. In Oracle, a hierarchical query might work better, and perhaps even some regex. There are ways, but first it must be identified if it actually is an issue.

FWIW, Chris, this is your solution. I just wrote the query. :P At first i wondered if it was easier to use md5 or something. Luckily, i saw the folly of that idea quickly.
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As requested, i will run the query above piece by piece, and show the data. I will do each part, for completeness:

The first CTE (Common Table Expression) supplies the data. Since CTEs are supported by nearly every database, it is likely the best way to supply data when asking questions.


It is named Data here, since, well, it's data. The column is named Words, because that seemed to be the simplest name for the subject data. As expected, the query itself just returns the data:

Words
ball blue clock
clock blue ball
computer
computer macbook
radio

Note, as is always true, the order of the data is not guaranteed without an ORDER BY. It just happens to be convenient for the database to return the data in the same order, but it is not to be relied upon!

Next comes:


It is a recursive CTE, and the the first part lowers the text per Chris's suggestion, as most databases are case sensitive by default. Efficiency would dictate that the the first part start the first step. That was rejected for clarity, leaving all the logic in the second part. This introduces a redundant record with a NULL. Not a problem, as it is ignored later.

Consistent with its name, it splits the words into separate records, keeps the Words column as the id for the group. The_Rest is a working column, holding what hasn't been split yet. One_Word has one (new) word of The_Rest in each record.

This produces the following:

WORDSTHE_RESTONE_WORD
ball blue clockball blue clock(NULL)
ball blue clockblue clockball
ball blue clockclockblue
ball blue clock(NULL)clock
clock blue ballclock blue ball(NULL)
clock blue ballblue ballclock
clock blue ballballblue
clock blue ball(NULL)ball
computercomputer(NULL)
computer(NULL)computer
computer macbookcomputer macbook(NULL)
computer macbookmacbookcomputer
computer macbook(NULL)macbook
radioradio(NULL)
radio(NULL)radio

The order is completely different, but that doesn't matter, i reordered it here for clarity, as well as adding in (NULL) to help with the forum formatting. Anyway, we now have all the words split into separate records but connected to each other by their common Words column. The_Rest can be completely ignored at this point, as it was only a working column.

Connecting the words must be done, but in what order? Slapping words together is a bit tricky, and something many people want a function for. Oracle (finally) added one recently called LISTAGG(), PostegreSQL has STRING_AGG, and MySQL has GROUP_CONCAT, but others may not have a method. Because of this, i did it the "old fashioned" way. To make it really easy to understand, this step is bifurcated. The first step supplies id within each group based on an ORDER BY. The second step simply connects them based on the ids.

Without further ado, i present step one:

Words is the grouping since way back in the first CTE. One_Word is the newly created one-word-per-record column, Id is the ordinal (hmm... should have used Ordinal as the column name), Total is how many records there are, which eases the logic later.

WordsOne_WordIdTotal
ball blue clockball14
ball blue clockblue24
ball blue clockclock34
ball blue clock44
clock blue ballball14
clock blue ballblue24
clock blue ballclock34
clock blue ball44
computercomputer12
computer22
computer macbookcomputer13
computer macbookmacbook23
computer macbook33
radioradio12
radio22

ROW_NUMBER() gives us row numbers. OVER tells the analytical function what it is working on (over what group) PARTITION BY tells it the grouping, and ORDER BY tells it the order. It's that simple.
COUNT(*) is the function we all know and love, however OVER tells it not to act like the normal aggregate function (which would require a GROUP BY), but instead to PARTITION BY Words. Neat stuff.

So, we have split words, we know the order, guess what comes next:


We've already met Words, Id, and Total. (And we already lamented that Id should have be named Ordinal.) Concatenation is the newcomer and purpose of this (second) recursive query, slapping those words back together. The first part just pulls in the data, and starts the action by supplying the first word. (Pay no attention to the 'A' behind the curtain. It is A relic of A failed attempt at something else.) The second part just concatenated the rest. "This" refers to the query itself, "Pool" refers to the earlier query, that holds the rest of the words. The "word pool," if you will.

WordsIdTotalConcatenation
ball blue clock14ball
ball blue clock24ball blue
ball blue clock34ball blue clock
ball blue clock44ball blue clock
clock blue ball14ball
clock blue ball24ball blue
clock blue ball34ball blue clock
clock blue ball44ball blue clock
computer12computer
computer22computer
computer macbook13computer
computer macbook23computer macbook
computer macbook33computer macbook
radio12radio
radio22radio

The order as supplied by the database was different, but i supplied an ORDER BY clause to reorder it for this display. The last record is repeated due to that NULL added in Separate_The_Words. We'll remove it later when it is most convenient. However, this may be Oracle only. Oracle treat NULLs as zero-length strings, allowing them to be concatenated. Hmm.. That last record would likely be NULL in every other RDBMS (like it should be). But it still isn't an issue, as it will be removed later. Don't worry, be happy.

Well, the words have been reordered, like Chris wanted. However, we need to identify duplicates. Chris provided one method, but i couldn't resist providing another:


We have Words, the recently formed Concatenation, and, using that analytical version of COUNT() again, the duplicates. The WHERE clause makes sure we grab the final record. This is where all other records are ignored. To be perfect, and to work in other RDBMS's, that ought to be "Id = Total - 1". Oops. Um, i mean, this demonstrates why code review is a Good Thing(tm).

WordsConcatenationDuplicates
ball blue clockball blue clock2
clock blue ballball blue clock2
computercomputer1
computer macbookcomputer macbook1
radioradio1

Okay, Duplicates isn't actually duplicates. It's how many matches there are for that record's concatenation. (And, again, should have been named accordingly.) Nonetheless, in a good database, it's okay to play with matches.

With the duplicates identified, we just need to pull them out, and that is the purpose of the main query:


It has some Words for you, where there are duplicates. The ORDER BY will keep the pairs together, by ordering the column that identifies them as doubles.

Words
clock blue ball
ball blue clock


--

FWIW, the first draft of this reply was saved by my browser at 9:30. It's now 11:20. That's almost 2 hours. You better appreciate this!
 
Roel De Nijs
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Brian Tkatch wrote:FWIW, the first draft of this reply was saved by my browser at 9:30. It's now 11:20. That's almost 2 hours. You better appreciate this!


That's an awesome post! And exactly what I was looking for. It definitely made me understand that Brobdingnagian query much better Although the PARTITION BY stuff is still a bit unclear, but that's probably just because I have never used it before. From your excellent explanation, I think it's very similar to GROUP BY but where GROUP BY will (normally) reduce the number of returned records, PARTITION BY doesn't. Using this data setIf I would write a query with a GROUP BY clause on description, the query would result 3 lines with the corresponding count (ELASTHANNE=3, COTTON=2 and POLYESTER=1). If I would rewrite the same query using PARTITION BY, I'll still get 6 rows and for each row I have the corresponding count. Does this make some sense?

I (and many other ranchers) definitely appreciate your effort. And here on CodeRanch we have (besides ) cow and pie to acknowledge an awesome post and show gratitude. So I awarded you with 5 cows and 2 pies. Take good care of your new cattle and enjoy the pie! And this topic might appear in the February edition of the CodeRanch journal as well.

Kind regards,
Roel
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Roel De Nijs wrote:it's very similar to GROUP BY but where GROUP BY will (normally) reduce the number of returned records, PARTITION BY doesn't.



Yes. Except, that's a terrible way to look at it. You're example is correct, but let me give a quick explanation.

When a query includes an aggregate function, the entire query becomes an aggregate query. It must. How can you return both a single record's field and an aggregate on the same line? That would make no sense. You must return either an aggregate (for each group), or a column that is inherently similar throughout the entire group. Anything listed in the GROUP BY clause is by definition the same for the entire group (as it defines the group), so it can be returned without an aggregate. This, so far, is GROUP BY 101. If you don't know this, you don't know GROUP BY.

Nonetheless, i can cheat. I can use a subquery with an aggregate:
This would give me each Moo and the count of all similar Moos. However, this can become terribly inefficient. So, the analytical version can be written: Same thing, but more efficient. When queries get complex, analytical query really shine. Especially for reports that wants all sorts of groupings or correlations, which would be nigh impossible otherwise. And this doesn't even touch windowing functions like LEAD() or FIRST_VALUE(), but those are a bit more advanced. (And fun! )

Roel De Nijs wrote:So I awarded you with 5 cows and 2 pies. Take good care of your new cattle and enjoy the pie! And this topic might appear in the February edition of the CodeRanch journal as well.


Holy cow, Batman! Thank you. Now i can edit that earlier post with the extra "UNION ALL".
 
Rancher
Posts: 4801
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You are Tom Kyte and I claim my 5 pounds!
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Dave Tolls wrote:You are Tom Kyte



Compared to Java experts who don't spend all day in SQL, i may look like i know something. Tom Kyte, however, is a SQL hero and an Oracle god. I've learnt quite a bit from him, but it's a drop in the bucket. And, he likely would have written the query otherwise, as a hierarchical query would probably have been more straightforward with some possible regex.

Dave Tolls wrote:and I claim my 5 pounds!



Wikipedia is indeed useful at times. Lobby_Lud
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Brian's solution and explanation are a great example of exploiting the real power of SQL to deliver sophisticated data-processing functionality entirely within the database i.e. without shipping data back and forth or breaking into different languages/tools. This exactly the kind of thing that Tom Kyte was referring to in one of his Oracle books a few years ago, where he basically recommended the following approach to solving data problems on Oracle:

1. Do it in SQL if you can.
2. If you can't do it in SQL, do it in PL/SQL if you can.
3. If you really can't do it in SQL or PL/SQL, then try C or Java etc.

I've hardly ever needed to break out into C or Java since PL/SQL came along, and Brian shows just how much you can achieve with SQL alone if you dig into the advanced functionality provided by analytical functions.

Kudos, Brian!
 
Brian Tkatch
Bartender
Posts: 598
26
Oracle Notepad Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:This exactly the kind of thing that Tom Kyte was referring to in one of his Oracle books a few years ago, where he basically recommended the following approach to solving data problems on Oracle:

1. Do it in SQL if you can.
2. If you can't do it in SQL, do it in PL/SQL if you can.
3. If you really can't do it in SQL or PL/SQL, then try C or Java etc.



I love that mantra, and there's a linkable source for it too. In 2007 Tom Oracle published On Cursors, SQL, and Analytics, where Tom writes:

  • You should do it in a single SQL statement if at all possible.
  • If you cannot do it in a single SQL statement, do it in PL/SQL.
  • If you cannot do it in PL/SQL, try a Java stored procedure.
  • If you cannot do it in Java, do it in a C external procedure.
  • If you cannot do it in a C external procedure, you might want to seriously think about why it is you need to do it.

  • The mention of Java is for a stored procedure. C is next, because the procedure is external and thus less efficient. And my favorite part, if you still can't do it, you ought to take a second look at what you're trying to do.

    Chris, you are so right to post this. As the database is where to data is stored, and SQL is such a rich, data-centric language, it seems silly to do manipulation elsewhere.
     
    Roel De Nijs
    Sheriff
    Posts: 11604
    178
    Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Brian Tkatch wrote:Yes. Except, that's a terrible way to look at it. You're example is correct, but let me give a quick explanation.


    Don't be too hard on this java developer :p Thanks for the additional explanation! It really helps to understand the (basic) PARTITION BY clause much better
     
    Greenhorn
    Posts: 3
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I am going to go with Postgres on this one:

    First, let's set up the data as you presented it:

    This provides that table that you described in the original post.

    There is a bit to do here, so let's break it down. First we should probably break down the original lists of words, so that they are more amenable to processing.

    This breaks apart the original strings with multiple words, and presents them one per line using the regexp_split_to_table function. Note that these results are sorted in a way that will make our later comparison easier.
    original_wordsinvidual_words
    ball blue clockball
    ball blue clockblue
    ball blue clockclock
    clock blue ballball
    clock blue ballblue
    clock blue ballclock
    computercomputer
    computer macbookcomputer
    computer macbookmacbook
    radioradio


    With the results being sorted, we can then use another built-in function of postgres to concatenate them back together.


    By using the string_agg function on the sorted list of words, we are able to combine those together again (grouping by the original unsorted list of words) so that we have a normalized view of those lists of words, each sorted in the same way.
    original_wordssorted_words
    ball blue clockball blue clock
    clock blue ballball blue clock
    computercomputer
    computer macbookcomputer macbook
    radioradio


    Given that representation of the data, it is a simple query to get the results you wanted. Here we can use a common table expression, with the above query, to achieve our results:
    Many database vendors provide the ability to use common table expressions (CTEs). This is the "with" clause below.


    original_wordssorted_words
    ball blue clockball blue clock
    clock blue ballball blue clock
     
    Don't get me started about those stupid light bulbs.
    reply
      Bookmark Topic Watch Topic
    • New Topic