Win a copy of Getting started with Java programming language this week in the Beginning Java forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Should I User Solr  RSS feed

 
Pratik Thaker
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I am using Oracle 11g2 and we are having a schema where few tables are having more than 100 million rows (some of them are Varchar2 100 bytes). And we have to frequently do the LIKE based search on those tables. Sometimes we need to join the tables also. Insert / Updates are also happening very frequently for such tables (1000 insert / updates per second) by other applications.

So my question is, for my User Interface, should I use Apache Solr to let user search on these tables instead of SQL queries? I have tried SQL and it is really slow (considering amount of data I am having in my database).

My requirements are,

Result should come faster and it should be accurate.
It should have the latest data.
Can you suggest if I should go with Apache Solr, or another solution for my problem ?

Regards,
Pratik Thaker
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[First of all, sorry for this long reply - it ended up much longer than I wanted it to be. If you want the TL;DR version, scroll down to the 3 options below "Phew" before going for Solr]

In my opinion, you should consider Solr only if you find that it has features that you need, but are simply not available in Oracle, and are are too complex or effort consuming to implement in the application layer.

Solr has some nice features that are very useful for some projects. Features such as:
  • stemming, for language and grammar aware searching
  • faceting, for supporting filtered search interfaces
  • searching documents like PDFs and office formats
  • distributed and high availability searches using sharding and replication


  • On the other hand, it is a standalone specialized server. Which means the application layer is responsible for smoothing out all the integration headaches with master database.
    Issues will crop up, such as:
  • Justifying data consistency requirements:
    I've found that a lot of software engineers assume and overestimate the need for strong consistency among users.
    Most users rarely notice and often don't really care if there are some results that shouldn't be shown.
    But for logical people like developers, strong consistency seems like a neat and clean thing, while eventual consistency looks sloppy.
    You'll have to think and discuss which approach a majority of your users will find acceptable.
    Strong consistency is usually a pain to implement and often not worth the effort, especially when you need it between multiple data stores, like an oracle db and a solr index.
    On the other hand, if you sacrifice strong consistency, there will always be a couple of fault reports from users who complain that a search result could not be opened, and they will have to be tackled.

    So ask yourself, how critical is strong consistency?
    If it's absolutely critical (think Air Traffic Control system), you'll have to re-architect most of your application's DAOs and business logic to implement transactional consistency between DB and Solr.
    If not so critical, Solr integration is magnitudes easier because you can configure it to autonomously pull data from db.


  • Strong consistency usually requires a central broker component for routing all database operations
    You mention "1000 insert / updates per second by other applications". If multiple applications are autonomously updating your tables, then you'll have a tough time keeping the Solr index strongly consistent with DB. Does each application update the Solr index? Multiple applications updating a single Solr index will itself result in eventual consistency issues between the applications, because Solr keeps delta changes between index updates and reconciles them once in a while.
    So this again brings up the same question of strong or eventual consistency.


  • Data transfer strategy:
    Deciding whether application should push database updates to Solr, or Solr should retrieve them. Related to the above point on consistency.
    If you want strong transaction level consistency between database and Solr, you will have to re-architect your application to push the data, since pulling it from Solr using the DIH (Data Import Handler) is always asynchronous, which affords you only eventual consistency. This will manifest as search results which when clicked give "database record not found" errors. Is that acceptable to your users?


  • Network latencies
    Since all commands and queries from application to Solr go as HTTP requests over the network, you are introducing additional network latencies here. Keep that in mind, especially because your problem
    seems to more about performance than missing features.


  • Keeping DB schema and Solr schema in lock step
    Solr has its own schema, described in XML configuration files. Whenever you change DB schema, you'll have to update the Solr schema too and then reindex all affected records.


  • Handling JOIN tables
    This is an interesting problem. Since JOINs are dynamically created views of multiple tables but Solr schemas are static configurations, you will likely have to create Solr schemas for every JOIN query you write and index the JOIN results before hand.
    Generally, I don't follow this. I implement search result joining in the application layer, by combining 2 different Solr search results into one single result set. But then again, all those applications were being written newly, so I had the opportunity to pre-plan. To change an existing system would be prohibitively effort consuming, I suspect.


  • Phew! So what next?

    In my opinion, it's better to first explore and exhaust ways to use the system of record (Oracle db in your case) itself to solve your problem, before introducing a secondary specialized data store like Solr and all the additional integration problems it brings with it.
    Some possibilities I can think of:

    1. Basic check: Are all your searchable columns indexed? Use Explain plans to find out bottlenecks. Invariably, a lot of performance can be strained out of the DB this way.

    2. Use Oracle's full text search capabilities: Are you aware of Oracle Text? It supports features like stemming and consistency.

    3. Use replication to separate your transactional workload from querying workload:
    This is a common technique used in data warehousing (OLTP vs OLAP). Basically, you replicate your database.
    One replica is used as the primary system of record, and receives all your live inserts, updates and deletes.
    The other replica has additional views where tables are joined, and text or SQL queries are executed.


    Some additional thoughts on your requirements:
  • 100+ million rows
    Not a problem by itself for Solr, neither for indexing nor querying. JOINs may be the bigger problem.


  • 1000 insert / updates per second by other applications

  • Have already described above the eventual consistency issues when a single Solr index is updated by multiple autonomous applications. Alternatively, you'll require a central broker to do it, but that means re-architecture.

  • Results should come faster

  • Explore possibilities of doing this with Oracle itself, before deciding on Solr.

  • Results should be accurate

  • Accuracy is pretty subjective. Solr supports things like stemming and wildcard searches, but whether they are judged as more or less accurate compared to DB searches depends on your users. Use A/B user testing to decide that.

  • Results should have the latest data (ie, strong consistency with database)

  • This is a minefield, for reasons described above. First decide with all stakeholders whether you want to enter this minefield at all, before searching for a safe route through it. A manager / developer /tester will almost always say "hurrah for strong consistency", but your users may not even care.

     
    Pratik Thaker
    Greenhorn
    Posts: 5
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thanks Karthik for the reply.

    So let me answer the question why I can't use Oracle for searching.
    1. Since there are lot of insert / updates, Indexing columns will slow them down. Insert / Update is having higher priority in my case.
    2. I am not sure if Oracle Text search will provide faster result then any other tool. Also will it allow joins and Like and range based search.
    3. I am not sure what will be the infrastructural cost to replicate current database. Right now the DataBase VM is having more than 5 TB of disk space (More than 3.5 TB is already occupied by tables), 98 GB RAM, 8 CPU

    Should I also consider evaluating ElastiSearch, Cassandra etc ?
     
    Karthik Shiraly
    Bartender
    Posts: 1210
    25
    Android C++ Java Linux PHP Python
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    1. Since there are lot of insert / updates, Indexing columns will slow them down. Insert / Update is having higher priority in my case.

    This is the simplest possible solution and I think worth pursuing. It's a good idea to experiment and quantify that slowing down before dismissing it altogether.
    As I understand it, only some of your tables have varchar2(100) fields. So assuming that all or most inserts/updates will slow down is probably an exaggeration.
    Perhaps you can create a smaller test db with a million records from the production db and check from a user experience point of view if it's conspicuously slow.
    The reason to check from a user experience pov - and not just comparing DB query times - is that no matter what solution you go for, your strong consistency constraint means an insert/update will always be in a distributed transaction and will become slower than it is now.

    2. I am not sure if Oracle Text search will provide faster result then any other tool. Also will it allow joins and Like and range based search.

    The best way to be sure is to measure it against your current query times on a smaller test DB. They have good documentation on tuning for response times.
    As for capabilities, these are basically SQL clauses which are included in select queries like any other query, including for joined tables. They have their own syntax but are functionally similar to Like queries. I suggest you read the text querying chapter.

    3. I am not sure what will be the infrastructural cost to replicate current database. Right now the DataBase VM is having more than 5 TB of disk space (More than 3.5 TB is already occupied by tables), 98 GB RAM, 8 CPU

    Do you need all 3.5TB of data for those search queries? Perhaps you can explore oracle materialized views to replicate only the joined data and subset of columns which you actually query.

    Should I also consider evaluating ElasticSearch, Cassandra etc ?

    As I see it, the strong consistency requirement is the one which limits your options. If weak consistency is fine, then all these including Solr are potential solutions worth evaluating.
    Some problems - like redesigning the application layer - are common to all these approaches, mainly because you are integrating two different and diverse data stores.
    I'm not familiar with ElasticSearch, but like Solr, it uses the Lucene library for all its core indexing and searching capabilities. So functionally, it can do most things that Solr can do. I do know its configuration workflow is different from Solr. Since you now know the kind of problems you might face with Solr, I suggest you ask specific questions in ElasticSearch forum or this forum, so that people who are familiar with it can give you better information than I can.
    As for cassandra, I'm not aware of its text search capabilities or performance.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!