Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Remove duplication without affecting the performance and the issues i am facing in description  RSS feed

 
Punit Jain
Ranch Hand
Posts: 1028
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

Going to be a big description for my problem, but need to explain the situation.

Below is the current situation -
------------------------------------

In our product we show search histories on the UI, means what ever user searches we make a history for that search in the database and show it on the UI.
At present we show duplicate histories also. for ex- if a user do a search with some criteria (say, abc), now if the user do search again with same criteria (abc), we show it twice on our history page.

What we need to do -
--------------------------

we need to remove the duplication, and only to show the search histories once if they have same criteria. also we need to save it only once in the database. means we neither display the same search histories nor store them in the database. ofcourse if we store it only once, it will be displayed only once. so we need logic in our storing part here.

Approaches we are trying -
---------------------------------

Approach 1 - We thought to remove duplicate search histories by comparing their criteria's, but here the problem is that single search history itself can have so many criteria's and we store around 30 search histories in the database, so if we compare criteria of all of them one by one. we will loose the performance. so we dropped this approach.

Approach 2 - The second approach we are thinking is, at the time of creation of search histories, we will generate checksum based on their criteria, so if the criteria is same for two search histories the checksum generated would be same, and before creating (storing into database) the search history we first check if the checksum is already exist in the search history table for any search history or not, if it is already there, don't create the search history, as it is already there.
The problems with this approach is -

First Problem - we will need to create one more column in the Table which will impact our upgrade process, and we don't want to have an upgrade impact.

Second Problem - if user changed the sequence of criteria of search, for e.g. name criteria first and then number criteria but the values are same. and in the another search number criteria first then name criteria and values are same as first search, here the search results would be same as the criteria are same but only the sequence is different, so in this case two histories will get created (as we generate checksum based on criteria so if sequence gets changed checksum would also be different), however we want only one search history in this situation. we don't want to consider the sequence.

PS :- we store the search histories in XML format and we convert this xml to blob to store it into database, i thought to first sort them then generate checksum, but as they are in xml format, so i can't even sort them because the tags are similar to all the criteria's.

Please let me know the other alternative which solves the problems above two approaches has?

Thanks in advance.
Punit
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ok, so first thing to remember that, if you have lots and lots of data that you need to search for, you need to make sure you don't do any of the convert to XML and store in the BLOB business. You want search to perform well, make it easy to search. THe first rule of writing any kind of search is to make search easier. If you want to convert to XML, so you can send it to someone else, either convert to XML when you are going to send it, or add another column that stores the XML representation. Hard disk is cheap. CPU/IO is costly. Don't make CPU and IO do more work so you can save few bytes on the hard disk. Depending on how large your table is, the easiest solution might be to just store the raw search history in a column that is indexed. Just try that out first, and you may realize that you don't need no checksums

If indexing doesn;t work, then I think you are on the right path with the checksums. You just want to remember that there can be collisions in checksums. You saw some of these collisions when you reversed the search terms. Collision can happen in other cases too. You should check the search terms if the collisions match. If you are using a database query to do this, put an index on the checksum column. Really, the index does do some of this magic for you. Really if you have a hash index, it's like the database will create a checksum internally. I don;t think you will need to do this

Lastly, if you are talking about billions of records, you need to either look at using something like Solr.. or maybe some other NoSQL database. They are designed to distribute the data over several shards, and run a search like this over multiple threads/processes/machines
 
Punit Jain
Ranch Hand
Posts: 1028
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for your response Jayesh.

We convert criteria's to XML because we have import/export of searches feature in our product, that is the reason we use xml and convert it to blob to store in separate column in database.

We use Solr for indexing, but we allow user to do both the searches Index as well as DB, so if the user does not have indexing module, the search would be plain db search.

If you are using a database query to do this, put an index on the checksum column. Really, the index does do some of this magic for you. Really if you have a hash index, it's like the database will create a checksum internally. I don;t think you will need to do this.

I am not very sure about how do i do this but i am going to try this. did you mean, if i set index on the BLOB column, it will generate the checksum automatically?
or did you mean i will only create one database column and "checksum", and i won't generate checksum by my logic, and i will just put an has index on this column "checksum", and checksum will automatically gets generated in this column? if yes, then on what criteria?

Thank you for all your help.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1) add another column to store the search term as a string without the XML tags
2) index that column
3) in your code, search the indexed column when you search for duplicate terms. Don't search in the blob.

This is the simplest solution. You will probably see performance issues even if you have millions of rows in your search terms table, it depends on what your performance criteria and load.
 
Winston Gutkowski
Bartender
Posts: 10573
65
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Punit Jain wrote:We convert criteria's to XML because we have import/export of searches feature in our product, that is the reason we use xml and convert it to blob to store in separate column in database.

Given your current setup, I presume that you can be pretty sure that you currently have duplicate searches in your database, which suggests to me that even if you add a checksum field, there will be a 1:many relationship between 'checksum' and 'search'. In addition, no checksum can ever guarantee uniqueness, it can only make it likely.

Also: How do you currently key your searches? Date/time? Or some combination of "fields used"? Does the table you store them in even have a unique symbolic key (as opposed to an ID)?

It strikes me that you want to adopt a new approach, but aren't willing to make many changes to the database. If I were looking at something like this, I think I'd probably look at a new table that ties a checksum to a search (or searches). If you keyed that new table by, say checksum + XML-length, I'd say you'd have to be pretty darn unlucky to run into a collision; and if you add those columns to your existing search table, you effectively have a "foreign key".
Alternatively, if your existing table has an ID as a primary or unique key, create a new table with checksum, XML-length and searchID columns, and you wouldn't have to change your existing table at all.

Simply go through each row in your current table and calculate the checksum and XML-length for your existing searches, then add those to your new table, along with the ID of the search table row that generated them.

You could then use it as a basis for reducing your existing table to only "distinct" searches, and then possibly add some referential integrity.

HIH

Winston
 
Punit Jain
Ranch Hand
Posts: 1028
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ups, somehow i didn't get email update for your response Winston.
my apologies for replaying late.

Winston Gutkowski wrote:Given your current setup, I presume that you can be pretty sure that you currently have duplicate searches in your database, which suggests to me that even if you add a checksum field, there will be a 1:many relationship between 'checksum' and 'search'. In addition, no checksum can ever guarantee uniqueness, it can only make it likely.


Yes i can be pretty sure that we have duplicate searches in the database, but we store only 30 histories in the database, we remove them based on FIFO. since checksum has 2^32 possible combinations, i believe there are very less or no chances for duplication.

Winston Gutkowski wrote:Also: How do you currently key your searches? Date/time? Or some combination of "fields used"? Does the table you store them in even have a unique symbolic key (as opposed to an ID)?


we key them by combination of fields. we store a unique ID for each search history.

Winston Gutkowski wrote:It strikes me that you want to adopt a new approach, but aren't willing to make many changes to the database.


yes, actually since these changes are going with a minor release, we are avoiding db changes.

that is the reason we are mostly going with simple XML/String comparison.

 
Punit Jain
Ranch Hand
Posts: 1028
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
and yes of course, when we would be merging this to the major release, we would be mostly be using checksum.

however i came up one doubt of using checksum after reading it's definition on Wikipedia which says -
"A checksum or hash sum is a small-size datum from an arbitrary block of digital data for the purpose of detecting errors which may have been introduced during its transmission or storage". however here our purpose is not to detecting errors in the data, our purpose is to compare the data.

Am i using checksum for different purpose however they are made for some different purpose?

Thanks,
Punit
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, using a checksum as an index is a good use of a checksum. You just need to make sure that your checksum algorithm minimizes collision.

Checksum is basically a quick way to determine the 2 pieces of data are not the same without comparing them byte for byte.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!