Forums Register Login

How to find out the encoding of a text file in UNix Solaries

+Pie Number of slices to send: Send
Hello,

How to find out the encoding of a text file in Unix Solaries, that is, what encoding was used when the file was created?

Regards,
Mike
+Pie Number of slices to send: Send
Ask the person who was responsible for creating it.
+Pie Number of slices to send: Send
Thanks Paul for your time to respond to my question.

But, that is not what I am looking for. I am asking a technical question. I wanted to know if Unix Solaries provide a utility/command to do this job or if there are any scripts to do it.

Regards
+Pie Number of slices to send: Send
You normally can't tell, if there is no metainformation, and simple textfiles normally don't have any.

I once wrote a script, which depends on the reader to know at least a word of the text, which may help decide what encoding it is.

It depends on sed, grep and iconv - maybe all available on solaris as well.


For example, you make a fast look into the file, and see "Begr??ung", and from the context you reconstruct, it has to be Begrüßung (you will have very differnet examples in you language, I guess) then you start the tool by


Contrary to the usage-message, it needn't be an Umlaut (diphtonge? mutated vowel).
+Pie Number of slices to send: Send
In other words, you cannot determine the encoding, only deduce it. That's because there's considerable overlap on most code pages, so the only way to figure out if a specific encoding was used is to find a usage that only makes sense for that encoding.

Note that we're using the word "encoding" here to mean character set encoding. If what you really meant was that you wanted to determine what type of file you're inspecting is, there's a process called "magic" that can be used to scan files for signatures and deduce the filetype from that. For example, the hex sequence 0xCAFEBABE at the head of a file is an indicator that a file is a Java class file.
+Pie Number of slices to send: Send
 

Tim Holloway wrote:... there's considerable overlap on most code pages,


Yes, but maybe it is not the problem then.

When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).

Tim Holloway wrote:
Note that we're using the word "encoding" here to mean character set encoding.


From the first question, I don't have much doubt we're talking about character-encoding.
+Pie Number of slices to send: Send
 

Stefan Wagner wrote:
When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).



Well, not necessarily. I used to work with a system that had original used IBM mainframe data terminals (green screen) for Data Entry. The application was COBOL-based and there was an inherent assumption that the data being entered and stored was going to be US-EBCDIC. Over the years, the green screens got replaced with IRMA (Windows 3270 emulation software) as PCs replaced the old mainframe "dumb terminals". And people started typing in characters that weren't available on the US 3270 terminal models (where even lower case was often an extra-cost option). Names like Alberto Peña, for example. Then they started shipping the mainframe data to Java apps running on servers. Due to code page mismatches, the foreign character codes got translated into multi-character sequences by the Java convertors and we ended up with things like "Alberto Pen~a". Which was just the start of our troubles. Because this stuff was coming down in rows of fixed-length columns without delimiters, but suddenly some of the fields were no longer the expected size. As were the records themselves.

So suddenly it did make a difference, and a significant one at that.
+Pie Number of slices to send: Send
 

Tim Holloway wrote:... things like "Alberto Pen~a" ... rows of fixed-length columns


Outch! I can feel the pain! ;)

The only cure for that is hours of television radiation. And this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com


reply
reply
This thread has been viewed 22398 times.
Similar Threads
I/O
Spanish characters not displayed in xml file in Unix
Platform dependency of AWT
unix solaries
How do I find a text file is Unix or Dos format via Java API?
Thread Boost feature
More...

All times above are in ranch (not your local) time.
The current ranch time is
Mar 28, 2024 11:02:51.