How to get CharsetDecoder?

Sep 10, 2012 22:29:22

Hi All,

I've set of .txt files which are having different different encoding version. For example,

1) a.txt ---> Encoding version is ANSI
2) b.txt ---> Encoding version is UNICODE
3) c.txt ---> Encoding version is UTF-8

Now,

how can I read these files in a single class?

that is,

If the file path I entered in the console is related to a.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream));

If the file path I entered in the console is related to b.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UNICODE"));

If the file path I entered in the console is related to c.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8"));

How can I know the the entered any .txt file encoding version?

So that, instead of hardcoding theUNICODE/UTF-8/ANSI, I can get the version in a variable and can use the variable as second parameter to InputStreamReader.

Please help me out.

Sep 11, 2012 02:25:40

lee chan wrote:
How can I know the the entered any .txt file encoding version?

You can't with any certainty. As an indication of this - how could one tell the difference between bytes from the ISO-8859-x family? They all use one byte per character and very very frequently use the same byte values for different characters.

Sep 11, 2012 03:28:34

Hi Tookey ,

Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.

So, How can I get the encoding formats of those two files in java.

Sep 11, 2012 03:54:56

Look at this, where you find there is no such thing as Unicode encoding. Whoever wrote that save dialogue made a mistake there. You cannot get the encoding from a file, unless you recorded it somewhere.

Sep 11, 2012 04:04:12

lee chan wrote:
So, How can I get the encoding formats of those two files in java.

If you are only trying to discriminate between text files generated by Notepad with either ANSI encoding or UNICODE encoding then the first two bytes of the UNICODE encoded file will be (0xff,0xfe) which is a 'Byte Order Mark' or BOM . I stress that this only works when you want to discriminate between those two encodings and will not work for just any old encoding.

Note - Java will not cleanly handle UNICODE encoded files that have the BOM - it tries to interpret the (0xff,0xfe) BOM pair as characters. The easiest way to deal with this is to just strip the first two characters when reading the file. See http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=31 .

Sep 11, 2012 05:18:13

lee chan wrote:Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.

Just because you can select an encoding while saving the file, does not mean that you can find the encoding when reading the file. Text files do not explicitly store the encoding. As Richard Tookey says, there are different encodings which look a lot like each other and there is no way to distinguish between the two automatically.

There are libraries to guess the encoding, for example juniversalchardet. But these will not always guess the encoding correctly, because that's not possible in principle.

Sep 11, 2012 06:47:05

lee chan wrote:How can I know the the entered any .txt file encoding version[/b]?

So, having read all the good advice so far, the next question is: Are you in control of the text files you're reading in?

If you are, the simplest thing to do would be to first change all the places that write those files to use a standardized file suffix documented by your system. Eg:
.ansi.txt (by which, I assume you mean Windows-1252)
.utf8.txt .ucs2.txt (your 'Unicode' format, I suspect)
There may even be an existing suffix standard that you could use; but if not, my suggestion would be to keep it as simple and visual as possible.
Also, because UTF-8 and 7-bit ASCII can both be read as "UTF-8", you could simply use it as the "default" (.txt), and use a suffix as above for anything that isn't UTF-8 or 7-bit ASCII.

Which brings up a final point: There is ONE format that can be distinguished, but only by reading it in its entirity: 7-bit ASCII.
If no byte in the file has a value > 127, then it must be 7-bit ASCII. It may sound crude, but if you have thousands of existing files to "determine", you may find that it culls a large proportion of them, leaving you with only a few to worry about.

If you aren't in control of the files you receive, my suggestion would be to talk with your suppliers about instituting such a system. Alternatively, you could make BOMs (Byte Order Marks) mandatory; but I don't know whether they would cover all the styles you need.

Winston

Sep 11, 2012 07:54:45

Winston Gutkowski wrote: . . . talk with your suppliers . . .

That is a good point. It is the responsibility of the supplier of a file to make sure it is legible, not the responsibility of the recipient to work out how to read it.

Sep 12, 2012 07:59:10

Thanks to all.

I started work as you guys suggested. My work is going smoothly.

Sep 12, 2012 08:08:29

Well done

It is no measure of health to be well adjusted to a profoundly sick society. -Krishnamurti Tiny ad:

a bit of art, as a gift, that will fit in a stocking

https://gardener-gift.com