Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

better option than TarInputStream to untar a tar file in terms of performance....anyone?

 
ruth abraham
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi guys...
My app currently makes use of the TarInputStream to untar a tar file of around 10k. All that needs to be done after is to take the content files and place them in a separate directory.
Can anyone help me figure out a better way (in terms of performance) to do this untar part?


Thanks!
Ruth
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
10K meaning 10000 bytes? That's so small it should hardly take any time. What timings have you done?
 
fred rosenberger
lowercase baba
Bartender
Posts: 12196
35
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What are your documented performance requirements?

Without having a well-defined target, how do you know when you're done? it's probably always possible to 'improve performance', but there is the law of diminishing returns here.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
ruth abraham wrote:My app currently makes use of the TarInputStream to untar a tar file of around 10k. All that needs to be done after is to take the content files and place them in a separate directory.

OK. Use the native command.

Can anyone help me figure out a better way (in terms of performance) to do this untar part?

Don't worry about performance until you know it's an issue. Worry about getting it right.

And to that end: Why are you actually writing these files out at all? Unless they are actually needed by other, unrelated, applications, it seems to me that this task is likely to be I/O-bound. And you ain't going to solve that unless you rethink your strategy.

Winston
 
ruth abraham
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, 10 KB is rather small. But the problem here is that this action is done for 7500 files of 10 KB each, every 15 minutes. We place the untarred files in another path is for consumption by other unrelated apps. Now the problem here is the number of files to be untarred every 15 minutes is going to increase from 7500 to 11000 and I was hoping there would be a better way than my current approach.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Again, though, without concrete requirements, and concrete measurements showing how far you are from meeting those requirements, you're stumbling around in the dark.

Is it okay to take the full 15 minutes to handle each batch? If not, how much time are you allowed?

Is it okay for a batch to occasionally take more than 15 minutes to process, or does one batch have to complete before the next one starts?

How long will your current approach take to process 11,000 files? If you need to be done in 15 minutes and it's taking 16, the solution will probably be very different than if it's taking 60 minutes.

How do you know that the TarInputStream is the bottleneck, rather than one of your own classes or some third-party library? Have you use a profiler to measure it, or are you just guessing?

There are many possible ways to speed up the process. It's impossible to say at this point which ones are most appropriate for your case.

  • Get a faster disk.
  • Get a faster CPU.
  • Get more RAM.
  • Use multiple computers in parallel.
  • Put the source and destination on the same physical drive/controller.
  • Put the source and destination on separate physical drives/controlers.
  • Don't use tar.
  • Find a 3rd party library that's faster than the TarInputStream you're currently using.
  • Get hold of the tar spec and write your own TarInputStream.
  • Find the bug in your code that's the real culprit and fix that.
  • Add a BufferedInputStream around your TarInputStream and read chunks at a time rather than individual bytes.
  • Don't do anything, because you hadn't actually measured previously, but now that you did, you find that it's running plenty fast enough.


  • Some of those are likely to be of little or no value, but without more details, it's impossible to say which ones will be appropriate and which will not.
     
    Winston Gutkowski
    Bartender
    Pie
    Posts: 10527
    64
    Eclipse IDE Hibernate Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    ruth abraham wrote:Yes, 10 KB is rather small. But the problem here is that this action is done for 7500 files of 10 KB each, every 15 minutes. We place the untarred files in another path is for consumption by other unrelated apps. Now the problem here is the number of files to be untarred every 15 minutes is going to increase from 7500 to 11000 and I was hoping there would be a better way than my current approach.

    OK, so it sounds like you're treating the file system like a database - which is not what they were designed for - and I suspect that most of the I/O will be taken up with the "write" side of this task (creating directories, adding nodes and chains, writing files etc.).

    In addition to all the things that Jeff listed, there is another possibility if this is a Unix/Linux system (it may also be possible on Windows, but I don't know how):
  • Tune the target filesystem(s) for a small number of bytes per node.
  • Unix fs's are configured for most general use, but it sounds to me like you're creating tons of very small files inside directory structures; and for that, the default isn't so great.

    Whatever you come up with, it strikes me that your current methodology may not be very scalable, so you may actually want to think about completely different strategies, eg:
  • Don't untar at all, and have all applications read the tar files as streams.
  • Create a function that untars files on the fly so that they can be read/scanned by a normal program.
  • Put the data in a database.

  • Winston
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic