Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Architecture support about 1.5-5 Crore transaction  RSS feed

 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

today suddenly my VP as called me and my friend and assigned a challenging task..

1. when user click on a button from UI.
2. a process or job should start and need to create about 1.5 Cr invoice transactions in future this count can to reach 5 crore invoices.


this 1st time i am working on huge data processing. we started research/goggling to get the possible approach.

so guys , can you please help in best option. to process such huge data ..


For information: multiple requests of such kind can be triggered by multiple users from front end application.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you say 1.5 crore "Transaction", do you mean 1.5 crore records inserted? or updated? or is it a bunch of operations that involve transformation of data? What exactly do you do in a transaction? Are all your transactions in the database? or can you bring things into memory, compute the results in memory and then insert the results in database? Your VP probably doesn't mean database transaction. He probably means a business transaction. First thing you need to find out is what exactly is in the transaction.

The reason why this is important is if each "transaction" is independent of the other, then you can run them in parallel. Let's say 1 "transaction" takes 10ms, then doing it on a single thread will take 1.5 crore will take 15 x 10^6 x 10 x 10^-3 = 150,000 seconds = 41.6667 hours. = 1.7+ days. I'm pretty sure your user is not going to sit there for 2 days waiting for 1.5 crore "transactions" to get over

So, the only way you can reduce this is by running it in parallel. If you can run it on 250 parallel threads, you can get this down to 10 minutes. Of course, if you run the transaction on 250 threads, and you do everything in database, then your database will have to be scaled up to handle this massive load


Most applications that do this kind of massive processing use a framework that provides them the ability to run a Map Reduce job in parallel on many machines. There are many available now:- Apache Hadoop, Apache Spark, Apache Storm. You might want to look at these frameworks first.
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't start the same discussion in two different threads.
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Jayesh A Lalwani.

All 1.5 Cr Transaction are individual , A single Transaction do some calculation on data(which is taken from other master tables and reform) and insert one more records in 4-5 tables and there will be some updations too.
each transaction a unique configurable id will be generated. and this should not be missed.. let say a transaction id is abc1 the next one should be abc2... abcxxx. i guess even in parallel processing this is not a problem.


Thank again for advising some frameworks . first we go through Apache Hadoop so me might get some information on how to start.
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
apologize Joanne Neal ,

Initially i opened the discussion in java related sub group actually my concern related to data base , i taught each subgroup is maintained only by specific experts. so i reported again in oracle related subgroup.

Thank you .
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd be curious to know what the timing expectation/requirements are. If the expectation is that it gets done in an hour, you may need some serious hardware for this. If it's OK to be done in 24 or 48 hours, maybe not, and maybe not even special software.
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ravi Kommuri wrote:Thank you Jayesh A Lalwani.

All 1.5 Cr Transaction are individual , A single Transaction do some calculation on data(which is taken from other master tables and reform) and insert one more records in 4-5 tables and there will be some updations too.
each transaction a unique configurable id will be generated. and this should not be missed.. let say a transaction id is abc1 the next one should be abc2... abcxxx. i guess even in parallel processing this is not a problem.


Thank again for advising some frameworks . first we go through Apache Hadoop so me might get some information on how to start.

Updates are problematic in Hadoop, because they require you to search for an arbitrary record somewhere in a huge data store, then do something to it before writing it back (to the same place in storage?). Some Hadoop-based tools like HBase will allow you to execute update operations, although I'm not sure how they're implemented underneath. Other tools like Hive are basically intended to be write-once i.e. no updates on existing records. You might want to think carefully about your processing and data requirements in order to choose the right tool here.

NoSQL databases might be another option, as these mostly allow you to scale out fairly easily, but you'd need to look carefully at your transactional requirements. For example, if you are doing multiple inserts/updates as part of a single transaction, then you need to make sure those transactions are handled consistently by your chosen database. Also think about how to partition and index your data, and whether the database supports this.

Or you could look at a powerful relational database, which will provide full ACID transactions but may struggle to scale up to your anticipated workload (unless you spend a lot of money). But it's not clear what your workload really is. 1.5 crore = 15 million transactions (right?), but in what period? How much data do you expect to be keeping (and searching through for updates) in your system? If it's 15 million transactions a year, then a conventional database would probably be fine.

In terms of concurrent processing, you'll need to look at the appropriate programming tools/frameworks for your needs. You might want to think about using a higher-level language like Scala with Akka for concurrency, rather than coding a lot of low-level thread coordination yourselves. But it will depend on your requirements and the skills you have available.
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer ,

Customer expectation is to complete all his billing process in 24 to 36 Hrs not more than 48 hrs.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes updates will be problematic, no matter what technology you use, because when you say "updates", there is a possibility that 2 transactions will need to update the same record at the same time which means that they will lock the record when they update it. Now, once you start talking about locks, you are talking about adding bottlenecks in the system. You can use the biggest, most powerful database to reduce update contention, but you don;t eliminate the contention. When you talk about Big Data, you need to design your processing to eliminate contentions, which includes update contention.

For example, let's say you were designing a process that ingests addresses of 150 million people. Let's say, on an average, each person has 10 addresses. Which means you are looking at 1.5 billion records being ingested. Now let's say, in your results, you wanted to keep a count of how many addresses each person has. So, you have a transaction that takes an address from the input, does some processing in it, inserts an address record in your database, and then increments a count on the person record. You spread this processing on 20 computers. Each computer takes one records, and performs the transaction. Now, if 2 computers happen to have 1 record each from the same person, both will try to increment the count. One of them will lock the Person record. The other will wait until the first one commits. So, your second computer is sitting there doing nothing but wait. Boo hoo! Contention. You have waster precious time that could have spent processing another record

One strategy could be:- Distribute the load in a manner so that no 2 parallel processes will update the same record. So, in our example above, you could distribute the records in such a way that all the records from the same person always go to the same computer. So, no 2 computers will update the same record. No contention. The person here is your load distribution key. THis works when your load distribution key distributes load evenly. If one person happen to have 20 million addresses, you wll have the computer that is processing that person lag behind the rest

Another strategy is:- DOn't update the database on every update. DO it at the end. Each computer can keep a count of how many addresses it saw per person in memory. After all computers are done, they send these counts back to a master. The master takes all the counts, adds them all up, and then updates the records in the Person table. This is what is called as Map-Reduce. THis works when your updates are much smaller than your processing, and if the results can be merged together in the end.
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
chris webster ,

thanks for writing.

to give big picture on our exact requirement. it is simply postpaid bills like airtel or any telecom operator generates our post paid bill at end of every bill cycle.

15 million orders per month from one region say mumbai .. and once billings are generated to end customers by system on processed orders . we are archiving the old orders for every 2-3 months and then keep for 2 yrs in . after that delete completely. because our customer order life cycle is very less max 15 -20 days .

currently on an average our customer is receiving 4lakh orders each day from a region. all the thing is as they are contracted customer they service. they bill to their customer on monthly basis .

and coming to updations
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
in more details.

let say 15 million orders . break them in 20 thousand orders belong to different different end customers.

so per each customer we need to generate one bill based on the customer (i.e 20 thousand orders).

for each order applicable charges will already been calculated and ready in some XYZ_charges table. now from master tables we need to get taxes information and any discounts as per contracts. and process a each order and sum up all orders of of a individual customer and insert in invoice_header table. and charges , taxes need to insert in invoice details table. and once an invoice is ready again we need to move this to external finance application like (coda/ navision).

actually i am thinking dividing the orders based on customer and move one customers orders to one machine. so that conflicts will not araise.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, it makes sense to divide the execution by Customer. Actually, your use case is a perfect fit for a Map Reduce application. Apache Hadoop is a good fit for it. Actually, you might want to look at Apache Spark. This page has some videos on it. Apache Spark runs on top of Hadoop, and does everything in memory. It makes it easier to write a map reduce application
 
Ravi Kommuri
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Jayesh A Lalwani .

can you share any url where i can find a example program which gets data from oracle/any data base to file system and process and re insertes the data back to database in other tables.
using Apache spark or hadoop..

 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!