This week's book giveaway is in the Beginning Java forum.
We're giving away four copies of Get Programming with Java (MEAP only) and have Peggy Fisher on-line!
See this thread for details.
Win a copy of Get Programming with Java (MEAP only) this week in the Beginning Java forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Bear Bibeault
  • Knute Snortum
  • Liutauras Vilda
Sheriffs:
  • Tim Cooke
  • Devaka Cooray
  • Paul Clapham
Saloon Keepers:
  • Tim Moores
  • Frits Walraven
  • Ron McLeod
  • Ganesh Patekar
  • salvin francis
Bartenders:
  • Tim Holloway
  • Carey Brown
  • Stephan van Hulst

Best Approach to read huge files utilizing multithreading  RSS feed

 
Ranch Hand
Posts: 382
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In case we have to read huge files (in GBs), then what is the optimal way to read the file using multithreading?
 
Bartender
Posts: 9570
189
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using more than one thread to read a file is usually a really bad idea. However, if the contents of the file were designed for it, you can do it.

The file requires an index of sorts that says which records can be found in what position in the file. Read the index to find out what records the file contains, and divide them up over separate tasks that are responsible for processing a portion of the records.

Why do you want to do this?
 
Vaibhav Gargs
Ranch Hand
Posts: 382
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Stephan. It was on an interview question, so, just thinking how optimally we can read and process the large files.
 
Marshal
Posts: 62275
193
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Using more than one thread to read a file is usually a really bad idea. . . .

Won't the file be locked by the OS, preventing several threads accessing it in the first place?

Consider reading line by line and passing the results to a parallel stream.
 
Stephan van Hulst
Bartender
Posts: 9570
189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Operating systems (at least Windows) only lock files for a single process. The process itself can still access it with multiple threads. You will first want to map the file to memory.

Using lines() only works if each record is represented by a single line, and each line can be processed without any other context.

I haven't tested it, but I think you could do something like this:
 
Campbell Ritchie
Marshal
Posts: 62275
193
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you
 
Vaibhav Gargs
Ranch Hand
Posts: 382
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In an interview, I was asked that if we have say 100 files and each file is of around 100 MB, so, what is the optimal way to process these files so that we don't get out of memory error and I/O operations are also minimal. I suggested to use 10 threads each thread reading 1 file at a time line by line and processing it. But, the interviewer was not convinced at all.

What other options could be feasible in such scenarios?
 
Campbell Ritchie
Marshal
Posts: 62275
193
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What about getting a buffered reader with the Paths(??)#newBufferedReader method(), then get a Stream<String> with its lines() method, and then turn the Stream to parallel.
Obviously a buffered reader only works on text files.
 
Saloon Keeper
Posts: 2276
290
Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Vaibhav Gargs wrote:In an interview, I was asked that if we have say 100 files and each file is of around 100 MB, so, what is the optimal way to process these files so that we don't get out of memory error and I/O operations are also minimal.



Vaibhav Gargs wrote:I suggested to use 10 threads each thread reading 1 file at a time line by line and processing it. But, the interviewer was not convinced at all.


Wouldn't that increase the chances of running out of memory compared to reading sequentially with a single thread??
 
Ron McLeod
Saloon Keeper
Posts: 2276
290
Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Plus, if all the files are located in the same file system, it may become I/O bound, with no speed gain from working with multiple files in parallel.
 
Stephan van Hulst
Bartender
Posts: 9570
189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using more threads doesn't minimize I/O operations. You still have to perform the same number of operations, you're just doing more of them at the same time.

Reading large chunks of data into memory minimizes I/O. This is diagonally opposed to preventing high memory use. You're always going to have a trade-off between the amount of I/O operations, and the amount of memory used.

100 MB is not a crazy lot, given that you're not working with some sort of embedded device. I would just read the files one at a time, map them to memory completely (either using FileChannel.map() or FileChannel.read() with a ByteBuffer that's large enough to contain the entire file, or by wrapping a FileStream in a BufferedStream with a buffer size that's at least as large as the size of the file). That way you have 100 I/O operations (one mapping/buffering operation per file), without a high risk of running out of memory.
 
Vaibhav Gargs
Ranch Hand
Posts: 382
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Ron McLeod wrote:Plus, if all the files are located in the same file system, it may become I/O bound, with no speed gain from working with multiple files in parallel.



If we run them in parallel using multiple threads, then, i believe it will have some speed gain though will consume more memory.
 
Vaibhav Gargs
Ranch Hand
Posts: 382
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:What about getting a buffered reader with the Paths(??)#newBufferedReader method(), then get a Stream<String> with its lines() method, and then turn the Stream to parallel.
Obviously a buffered reader only works on text files.



Yes, I suggested to read the chunks of data in buffers. How the stream will help in this case?

As per my understanding, we have below options:

1. Read each file line by line sequentially
2. Read files in parallel in different threads line by line
3. Read each file using buffers sequentially
4. Read files in parallel using buffers in different threads

Any other option exists in latest versions of Java which I am missing?
 
Ron McLeod
Saloon Keeper
Posts: 2276
290
Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Vaibhav Gargs wrote:If we run them in parallel using multiple threads, then, i believe it will have some speed gain though will consume more memory.


My point was the file system does not have infinite throughput/bandwidth.  Once you reach the throughput capacity, you may find that the overall performance will decrease.

Connect your phone/tablet to a desktop using USB, and try to transfer a collection of files.  You will find that if the files are copied sequentially, it will take less time compared to trying to copy all the files in parallel.
 
It will give me the powers of the gods. Not bad for a tiny ad:
RavenDB is an Open Source NoSQL Database that’s fully transactional (ACID) across your database
https://coderanch.com/t/704633/RavenDB-Open-Source-NoSQL-Database
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!