I have some byte offsets in a file. These are byte positions of some data in another file. I am reading the positions in file1 and using the seek method to look it up in file2. When I reach the required place in file2, I just call the readLine() and print that line.
I am using BufferedReader to read file1 and RandomAccessFile to read file2.
But this whole process is very slow. I need some performance improvement. How does seek() perform vis-a-vis skipBytes() or just skip()? Which is the fastest way to do what I am doing?
I doubt your seek() calls are much of a performance hog unless your files are extremely huge and you are doing little else in your code. You should profile your app and see where the real bottleneck is. Have a look at Using Hprof To Tune Performance to get started. Be aware there's other factors can drag your app's performance down. Make sure you have enough memory. If your OS runs out of RAM it will start swapping info to the disk rather than working on your problem.
Doing 10 million anything will take a lot of time. Again, make sure your hardware is up to the task. You can't expect high performance out of a desktop computer with a generic IDE hard drive when you are parsing 10 gigs. Lots of memory (use Task Manager to monitor memory use), a fast processor (Pentium 4 2.6ghz are common now), and a fat IO pipe are essential (ATA100 IDE, SCSI, RAID, in order of increasing performance). You can justify spending money on hardware because spending time tweaking code costs money too. Not to mention slow app performance costs the whole company time and money. As for buffering RandomAccessFile, I did a quick search for it and came up with this article. It's for JDK version 1.0.2. That's ancient history. However, looking at the source code for RandomAccessFile (look in your SDK root dir for a src.zip file) you can see that readLine() reads char by char until it hits a '\n'. That's just plain slow so this approach may still be relevant. I've seen this technique described as "custom buffering", for example, in Java Platform Performance. That book also has good recommendations as to how to measure performance and performance improvements. You should use Hprof to get a benchmark for your app's performance, then try out these alternatives using Hprof to see which is the best solution. Good luck! [ July 29, 2004: Message edited by: Joe Ess ]
I am currently working with both RandomAccessFile and DataInputStream (cheers Joe!) on very large files (GB's of data in each). I have found that seek() in RandomAccessFile is much quicker than skipBytes for moving about in the file. The data I have is a grid of values, I have two versions 1) a straight binary file which I access using RandomAccessFile where I use one seek() to jump into the file and then loop through doing a further seek() for each row of grid data I need to extract (this is a cookie cut operation for small parts of the main grid) 2) a copy of this data placed in a ZIP archive because we want to see if some disk space can be saved without compromising performance...this uses DataInputStream and it's skipBytes() method to move between rows in the loop...this is much slower than the RandomAccessFile seek() method, approximately by an order of 5 to 10 times slower in fact.
Originally posted by Ben Wood: I am currently working with both RandomAccessFile and DataInputStream (cheers Joe!)
Glad to be of service.
I have found that seek() in RandomAccessFile is much quicker than skipBytes for moving about in the file.
I would expect this. seek() moves the position that the file is being read from, the conceptual "file pointer". skipBytes() reads from the stream and discards the results, so it is still reading data from the disk, the slow part of IO.
Old thread, but just for the record: It's not really fair to compare skipping with DataInputStream vs. RandomAccessFile. The RandomAccessFile has knowledge of it's underlying stream... It knows it's talking to a filesystem, so it can take the shortcut of doing a "seek" with the underlying filesystem API. FileInputStream's skip() has the same advantage. However, DataInputStream, and any other general-purpose input streams, don't know that their source is a file, so they can't take any low-level shortcuts. Their only way to implement skip() is to read the bytes, which is certainly slower that telling a filesystem to "seek" to a new file position.