• Post Reply Bookmark Topic Watch Topic
  • New Topic

File Processing Queries  RSS feed

 
Vaibhav Gargs
Ranch Hand
Posts: 113
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

We have following requirements:

1. We will get the files at a common path from 3rd party systems
2. We will read the files placed at common path
3. And, perform some processing logic on data
4. Finally, dump into the database

Now, the main problem area is performance and scalability. The file sizes would vary from 10 GB - 60 GB. These files would contain millions of transactions.

So, my queries are:

1. What is the optimal approach to design the solution from performance & scalability perspective?
2. We need to process the files in minimum possible time and tomorrow it is expected that the file sizes will be doubled in next 4-5 years, so, the application should behave without any performance issues.

Tech Stack : Java 7, Spring, Oracle DB 11G, IBM Websphere App Server

Please share your experiences/thoughts to achieve this in best possible manner. Thanks in advance.

-VGarg
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The implementation can be a simple daemon thread running continuously in the background checking for a fresh file.
When a file is found, it can trigger an event in your code to process the file.

Assuming that the files have no relation with each other, a separate thread can be spawned to process each file. You can use a thread pool to reuse the threads. Your routine to read the file can depend on what constitutes a single transaction, whether it's one line per transaction or some kind of separator that separates one transaction from another. A file's read speed will depend on how fast the IO supports reading from the disk or solid state device.
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
On a side note, if this system is being designed right now, don't you think it's better to use the latest java version ? Java 7 was released on 2011 and its last update was in 2014.
 
Vaibhav Gargs
Ranch Hand
Posts: 113
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
salvin francis wrote:The implementation can be a simple daemon thread running continuously in the background checking for a fresh file.
When a file is found, it can trigger an event in your code to process the file.


Yes Salvin, I have created a poller which keeps on polling the common directory for any file and once a file is found, it triggers an event.

salvin francis wrote:
Assuming that the files have no relation with each other, a separate thread can be spawned to process each file. You can use a thread pool to reuse the threads. Your routine to read the file can depend on what constitutes a single transaction, whether it's one line per transaction or some kind of separator that separates one transaction from another. A file's read speed will depend on how fast the IO supports reading from the disk or solid state device.


Each record has a header & trailer record and in between, there can be N number of txns. So, not sure how can we solve this problem that we will read complete record using buffer and not an incomplete one. The file sizes currently is around 40-50 GB and in future it is expected to double.
 
Vaibhav Gargs
Ranch Hand
Posts: 113
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
salvin francis wrote:On a side note, if this system is being designed right now, don't you think it's better to use the latest java version ? Java 7 was released on 2011 and its last update was in 2014.


Unfortunately, we don't have a liberty to upgrade Java. All the systems are running on JRE7, so, we are bound to use that
BTW, do we have some better feature in JDK8 for such problems?
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vaibhav Gargs wrote:BTW, do we have some better feature in JDK8 for such problems?

yes, you can read about it here: https://docs.oracle.com/javase/tutorial/essential/io/fileio.html
Specifically, you can look at : https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#newBufferedReader-java.nio.file.Path-java.nio.charset.Charset-
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vaibhav Gargs wrote: ...Each record has a header & trailer record and in between, there can be N number of txns. So, not sure how can we solve this problem that we will read complete record using buffer and not an incomplete one....


So, If I understand correctly, you can read a file sequentially line by line, and when it encounters a specific set of character(s), it can be converted into a Record Object and processed. That's not too difficult right ?
The file size does not matter, you are just reading it a line at a time.
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What if a 60gb file does not have the "trailer record" ? Will you load the complete file into memory ?    You probably need some guard condition against these types of scenarios.
 
salvin francis
Saloon Keeper
Posts: 1644
37
Eclipse IDE Google Web Toolkit Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
salvin francis wrote:...yes, you can read about it here: https://docs.oracle.com/javase/tutorial/essential/io/fileio.html..

I stand corrected here, the new NIO.2 was a part of java 7.

The lines method returning a Stream<String> is a part of java8:
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#lines-java.nio.file.Path-
 
Vaibhav Gargs
Ranch Hand
Posts: 113
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Experts please share your views & experiences...
 
Paul Clapham
Sheriff
Posts: 22719
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd suggest your best bit is, if your current solution doesn't work quickly enough, is to get faster hardware. And make sure that the network connection between your processing machine and the database machine is fast and reliable.
 
Campbell Ritchie
Marshal
Posts: 56223
171
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I merged your stuff with the following thread. I hope that is okay by you.
 
Vaibhav Gargs
Ranch Hand
Posts: 113
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We are working on proposing a solution for the following system:

1. The system will receive some files at a shared path on daily basis. The file sizes will be really huge ranging  from 10-50 GBs. File formats can be text, csv as of now.
2. The files need to be read & dumped into the corresponding database tables after applying some business logics
3. Once it is persisted in the database, other systems can invoke the services - SOAP, REST, MQs with appropriate request and our system will respond to the requests.

We are looking for a solution which is performance oriented and there should not be any bottlenecks going forward while scaling the system.

Please share your thoughts on appropriate design, tech stack etc.
 
Campbell Ritchie
Marshal
Posts: 56223
171
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This post seems so similar to your old post that it is worth merging the two discussions.
 
Campbell Ritchie
Marshal
Posts: 56223
171
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vaibhav Gargs,
I have merged your topic into this topic. I hope that helps.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!