I am looking to see if there are any suggestions on if there is an ETL framework that can help us with converting our OLTP to OLAP.
We have this application which deals with a lot of data. There is an OLTP database where data that the UI uses is stored, and there is an OLAP database where the same data is stored for reporting. The user uses the UI to load data into the OLTP database. After the data is loaded, the user kicks off a "run". The run is executed in the background on the grid. It's a typical Map reduce application. as part of the background execution, the master has to first "publish" all the input data into OLAP, then batch the work to the workers. The workers execute the processing, and directly push the results to OLAP. The workers finish execution, and as part of the reduce function, the master does some aggregation, and puts the aggregated data into OLAP
The challenge is that we are dealing with lots of data, and to publish input data from OLTP to OLAP, we do straight Insert...Selects from the OLTP database into the OLAP database. The OLTP database schema and OLAP database schema are always on the same database server. Essentially, our entire ETL process is a bunch of Insert...Select statements that go across 2 schema.s We don't want to do the traditional "extract-transform-load" process. Extracting the large amounts of data to somewhere else will cause a lot of IO bottlenecks. As far as possible, we would like the entire ETL process to execute on the database server itself. From a performance POV, we are very happy with the Insert..Select.
The problem is that this is becoming very unmaintainable.
1) Tables in each schema have relationships between them (for example, run is associated with a portfolio which has a bunch of loans. Then there is an inherent relationship between tables on OLAP and OLTP. OLTP has a Run table;OLAP has a run dimension. OLTP has Loan; OLAP has a Loan fact and bunch of dimension tables, and so on and so forth. Everything is mirrored between OLTP and OLAP. So, anyone writing those publishing queries has to be familiar with the data dictionary on both sides, and also the relationships between the tables. It's getting kid of hard to track. Usually when we find a bug it's because the developer forgot to add a join clause. Or we change the schema, and forget to update a join clause in some queries.
2) As we are evolving, we keep coming up with best practices, and it's getting kind of hard to change all our queries to meet the best practices For example, we paritioned the tables by the key that 95% of our queries use. We figured that in Oracle, we reduce the load a lot if we query by the key. For example
is 3 times less costlier than
simply because loan table is parittioned by runid. In a non-partitioned table, there is no cost differrence. The first version however, requires us to make 2 database calls, which increases the java code
Now, we have only a gazillion queries that we implemented before the table was partitioned, and since both kind of queries were equally costly, and the second one requires less code, we used the second one. Now, there is going to be a huge effort to change all the queries to the second version
So, I was wondering if there was an ETL framework that can generate Insert Select queries that go across databases. I want to put the relationships between tables in a configuration file, and just tell the ETL that I want to put data from tables A, B and C in OLTP into tables X, Y and Z in OLAP. It should be able to figure out using the configuration file how to construct the queries. Also, I want to be able to say that when relating 2 tables, I should use 2 queries, or one with a join.
I guess the most concise way I can describe what I'm looking for is I'm looking for something like Hibernate but it should be able to work across database schemas. :lol: