This week's book giveaways are in the Cloud and AI/ML forums. We're giving away four copies each of Cloud Native Patterns and Natural Language Processing and have the authors on-line! See this thread and this one for details.
I'm encountering something I find very strange when dealing with ZipInputStream.
I compress several files into a zip and store its checksum. When the zip is read I control said checksum.
The thing is, depending which files I compress, on decompression sometimes the checksum does not match. I found the cause for this, but don't understand it.
I set up my input streams like so:
Then I just iterate though all zip entries and decompress them all. The thing is that sometimes the ZipInputStream will have no further entries to read, the available() method will return '0' (indicating EOF reached) but the underlying CheckedInputStream will still have some bytes that haven't been read! At this point the input's checksum differs from the original, but if I just read these remaining bytes directly from the CheckedInputStream then the checksums do match.
To be clear on this: when I say 'sometimes' it's not that it's random. It somehow depends on which files I zip; the same files will always yield the same result. Furthermore, the extra bytes would seem to be of no use at all; all compressed files seem to be there on decompression without damage (though I still have to check this last statement more thoroughly).
I've been looking around, but found no answer so far and am at a loss on how to solve this.
Any info/tips/suggestions would be highly appreciated. Many thanks already for taking the time to read this.
Best regards. [ August 20, 2008: Message edited by: everton landio ]
everton: The thing is that sometimes the ZipInputStream will have no further entries to read, the available() method will return '0' (indicating EOF reached) but the underlying CheckedInputStream will still have some bytes that haven't been read!
Many thanks Nitesh for your prompt reply. I had read the article you linked, but ZipInputStream redefines the available() method. Taken from the javadoc:
Returns 0 after EOF has reached for the current entry data, otherwise always return 1.
Programs should not count on this method to return the actual number of bytes that could be read without blocking.
My unzipping code is quite similar to yours: I iterate through all entries and extract them. The thing is, I get to the point where read() on the ZipInputStream returns -1, which is consistent with available() returning 0 (EOF for current entry) and getNextEntry() returns null, but there are still six thousand something bytes to be read from the underlying CheckedInputStream. aarrrrggggggghhhhhhhhh!!!
I found this comment:
ZipOutputStream produces a slighly non-standard format. ZipOutputStream puts the compressed and uncompressed size and CRC after the data, instead of in the local header just in front of it.
Could this have anything to do with my problem? Can the remaining bytes be this extra data? I doubt this is the case, since then these leftover bytes would appear when I extracted any zip, instead of just sometimes.
As a matter of fact, I don't use the available() method, I use read(...) until it returns -1. I just also checked that available() returns 0 while debugging, but I can see how my first post could have been confusing in that regard.
Bottom line is: I have a ZipInputStream with no remaining zip entries, no bytes to read from the last of the read entries, but some leftover bytes remaining in the underlying input stream.
I'll keep looking into it and post back here if I find the reason.
Meanwhile, any pointers you guys could give me would be very helpful.
posted 10 years ago
Ok, I have this kinda figured out so I'm posting (quite belatedly) my current solution.
I checked this thoroughly and am confident that those extra bytes are metadata created by the ZipOutputStream.
If the extra bytes are read, the checksums match, and I've used WinMerge to compare several sets of original files against their uncompressed counterparts and did not find a single difference.
I mentioned before that the difference in the checksums happened just in some cases. I found out that this was because I had the CheckedInputStream wrapped in a BufferedInputStream. When this metadata was sufficiently small, it was placed entirely in the buffer when reading the final portion of the last ZipEntry, and thus the cheksums coincided. When the BufferedInputStream was removed, all checksums showed differences, which was what I expected.
It's kind of weird though that info on this is not readily available... Makes me feel like I'm missing something or that maybe I'm not using the checked streams correctly.
Anyway there you have it, at the moment I'm reading all remaining bytes directly from the checked stream and if the checksum match I assume all is well.
As an extra comment, part of the metadata seems to be a timestamp or some other variable thingy: if I zip the exact same set of files in different occasions the resulting zips have different hashes, which didn't seem to be the case when using other compression tools.