• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

How to find out the encoding of a text file in UNix Solaries

 
Mike Yu
Ranch Hand
Posts: 175
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

How to find out the encoding of a text file in Unix Solaries, that is, what encoding was used when the file was created?

Regards,
Mike
 
Paul Clapham
Sheriff
Posts: 21322
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ask the person who was responsible for creating it.
 
Mike Yu
Ranch Hand
Posts: 175
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Paul for your time to respond to my question.

But, that is not what I am looking for. I am asking a technical question. I wanted to know if Unix Solaries provide a utility/command to do this job or if there are any scripts to do it.

Regards
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You normally can't tell, if there is no metainformation, and simple textfiles normally don't have any.

I once wrote a script, which depends on the reader to know at least a word of the text, which may help decide what encoding it is.

It depends on sed, grep and iconv - maybe all available on solaris as well.


For example, you make a fast look into the file, and see "Begr??ung", and from the context you reconstruct, it has to be Begrüßung (you will have very differnet examples in you language, I guess) then you start the tool by


Contrary to the usage-message, it needn't be an Umlaut (diphtonge? mutated vowel).
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18282
56
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In other words, you cannot determine the encoding, only deduce it. That's because there's considerable overlap on most code pages, so the only way to figure out if a specific encoding was used is to find a usage that only makes sense for that encoding.

Note that we're using the word "encoding" here to mean character set encoding. If what you really meant was that you wanted to determine what type of file you're inspecting is, there's a process called "magic" that can be used to scan files for signatures and deduce the filetype from that. For example, the hex sequence 0xCAFEBABE at the head of a file is an indicator that a file is a Java class file.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Holloway wrote:... there's considerable overlap on most code pages,

Yes, but maybe it is not the problem then.

When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).
Tim Holloway wrote:
Note that we're using the word "encoding" here to mean character set encoding.

From the first question, I don't have much doubt we're talking about character-encoding.
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18282
56
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stefan Wagner wrote:
When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).


Well, not necessarily. I used to work with a system that had original used IBM mainframe data terminals (green screen) for Data Entry. The application was COBOL-based and there was an inherent assumption that the data being entered and stored was going to be US-EBCDIC. Over the years, the green screens got replaced with IRMA (Windows 3270 emulation software) as PCs replaced the old mainframe "dumb terminals". And people started typing in characters that weren't available on the US 3270 terminal models (where even lower case was often an extra-cost option). Names like Alberto Peña, for example. Then they started shipping the mainframe data to Java apps running on servers. Due to code page mismatches, the foreign character codes got translated into multi-character sequences by the Java convertors and we ended up with things like "Alberto Pen~a". Which was just the start of our troubles. Because this stuff was coming down in rows of fixed-length columns without delimiters, but suddenly some of the fields were no longer the expected size. As were the records themselves.

So suddenly it did make a difference, and a significant one at that.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Holloway wrote:... things like "Alberto Pen~a" ... rows of fixed-length columns

Outch! I can feel the pain! ;)

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic