lee chan wrote:
How can I know the the entered any .txt file encoding version?
You can't with any certainty. As an indication of this - how could one tell the difference between bytes from the ISO-8859-x family? They all use one byte per character and very very frequently use the same byte values for different characters.
Look at this, where you find there is no such thing as Unicode encoding. Whoever wrote that save dialogue made a mistake there. You cannot get the encoding from a file, unless you recorded it somewhere.
lee chan wrote:
So, How can I get the encoding formats of those two files in java.
If you are only trying to discriminate between text files generated by Notepad with either ANSI encoding or UNICODE encoding then the first two bytes of the UNICODE encoded file will be (0xff,0xfe) which is a 'Byte Order Mark' or BOM . I stress that this only works when you want to discriminate between those two encodings and will not work for just any old encoding.
lee chan wrote:Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.
Just because you can select an encoding while saving the file, does not mean that you can find the encoding when reading the file. Text files do not explicitly store the encoding. As Richard Tookey says, there are different encodings which look a lot like each other and there is no way to distinguish between the two automatically.
There are libraries to guess the encoding, for example juniversalchardet. But these will not always guess the encoding correctly, because that's not possible in principle.
lee chan wrote:How can I know the the entered any .txt file encoding version[/b]?
So, having read all the good advice so far, the next question is: Are you in control of the text files you're reading in?
If you are, the simplest thing to do would be to first change all the places that write those files to use a standardized file suffix documented by your system. Eg:
.ansi.txt (by which, I assume you mean Windows-1252)
.ucs2.txt (your 'Unicode' format, I suspect)
There may even be an existing suffix standard that you could use; but if not, my suggestion would be to keep it as simple and visual as possible.
Also, because UTF-8 and 7-bit ASCII can both be read as "UTF-8", you could simply use it as the "default" (.txt), and use a suffix as above for anything that isn't UTF-8 or 7-bit ASCII.
Which brings up a final point: There is ONE format that can be distinguished, but only by reading it in its entirity: 7-bit ASCII.
If no byte in the file has a value > 127, then it must be 7-bit ASCII. It may sound crude, but if you have thousands of existing files to "determine", you may find that it culls a large proportion of them, leaving you with only a few to worry about.
If you aren't in control of the files you receive, my suggestion would be to talk with your suppliers about instituting such a system. Alternatively, you could make BOMs (Byte Order Marks) mandatory; but I don't know whether they would cover all the styles you need.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here