• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Compression/Decompression encoding in unix.

 
pawan chopra
Ranch Hand
Posts: 417
jQuery Mac Objective C
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I would like to know that what encoding is used by compression/decompression algorithms in Unix/Linux. For example in windows it uses Cp437.

 
Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15441
41
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Which compression/decompression algorithms? Are talking about for example gzip? It can compress and decompress any kind of data (text or binary), it doesn't have to do anything with character encodings.

If you mean something else, then please explain in more detail what your question is exactly.
 
pawan chopra
Ranch Hand
Posts: 417
jQuery Mac Objective C
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jesper,

Actually I am using Java zip utility to compress some files. File names contains some scandinavic letters(like ä and ö). Now after compression the ZipOutputStream uses UTF-8 to write file name so It doesn't give me the correct name. I saw that there is a bug in Java. I am trying to change the implementation of ZipOutputStream class to accept encoding in constructor. I have tried this for windows with encoding Cp437 and it worked fine for me. I was opening the same file in Unix but its not working there. so I was looking the encoding used by UNIX/Linux for compression/decompression file names.

Let me know If I have not made myself clear in that case I will explain you more on that. You can also refer to Corrupt File name.
 
Marco Ehrentreich
best scout
Bartender
Posts: 1294
IntelliJ IDE Java Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure if you already figured out if the question was about compression or enconding

But to convert different ENCODINGS there's a handy utility called "recode" which allows to easily convert between different encodings like UTF-8 or latin1 for example.

Marco
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18277
56
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you're confusing the compression with the facilities that create ZIP/JAR files in the Java compression classes.

Most of the popular compression algorithms are bit-level (binary) algorithms, so they don't care about code pages. There are, in fact, about 5 different algorithms used in ZIP files, and the normal course of events causes the most effective one to be used. In some cases, that's the "store" algorithm, which doesn't compress at all.

I never really paid attention to the limitations on code pages in a ZIP file directory. The first thing I'd do, however, is check the documentation for ZIP files themselves, since ZIP format was intended to be something that was portable even to the extent that you could move them between ASCII and EBCDIC (IBM mainframe) systems.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would recommend iconv to convert textfiles, not recode.

UTF-8 should be fine to store your filename, and should be understood by linux as well as Windows, so that's the way to go to get rid of conversion trouble.
 
pawan chopra
Ranch Hand
Posts: 417
jQuery Mac Objective C
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have executed the following experiment:
- created several text files with English, Hungarian, Chinese, Japanese and
Korean name
- attempted to compress them using FilZip, WinZip and PKZip
- attempted to uncompress then using the above tools
My findings are:
- FilZip and WinZip cannot add files with non-English-only names (not even
Hungarian which uses Latin characters); they cannot list files
- PKZip can add add file with any names, but names are transformed: all
non-Western European accented Latin characters are converted to similar
character without accent (e.g. ű->u, ő->o) and all non-Latin characters are
converted to question marks; NOTE: Accented Western European characters are
preserved (e.g. áéíóöúüñ), thus Spanish is supported
- WinZip cannot list non-Western European file names, but can extract the
files when "Extract all" is selected; but non-Latin characters are replaced
with underscore (_); since all non-Western European Latin characters are
converted to non-accented Western European ones during compression, these files
are listed and extracted but without accents.
- FilZip and PKZip can display and extract all files but with transformation;
see above

Summary: ZIp format does not support Unicode in filenames. It might be possible
to pick one specific code page/character set that would be usable for a
specific language, but it is not know how as tested tools do not provide
control for this.

Solution: No real solution. As workaround, Spanish text should be used with all
accented characters replaced with non-accented relative (ú->u, ó->o, etc.) or
compress files using ISO8859P1 character set for filenames.

Note: PKZip is one of the first zip utilities for Windows; WinZip is the market
leader. If they cannot support Unicode, how could we?
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18277
56
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just to repeat myself: Although PKZIP defined the ZIPfile standard, the file format standard long ago became independent of whether you used PKWare, Info-ZIP, Winzip or whatever. There are variants and recently struggles to work around the original limitations like 2.2GB/contained file, but there is a published standard, and that's what should be consulted to determine what's allowable for a file contained in a ZIP archive and what options are available.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well - I made 3 empty files with these names:

and packed them into a zip file:
and it doesn't surprise me to find those names inside the file:


Note that we don't talk about the files content, but the filenames.

The displayed filenames may depend on the font which is used by your programs, so an extraction might be correct, but the preview seems to show corrupted filenames.

I'm sorry the ranch doesn't allow zipfiles (or Jars) to be uploaded.

Update: I put it on my website: http://home.arcor.de/hirnstrom/tmp/suspicious.zip
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic