Win a copy of Murach's Java Programming this week in the Beginning Java forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

File input random characters  RSS feed

 
Charlie Lynch
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey, I've got the following program, it compares two values and checks if they're the same ignoring caps. However the characters are Chinese characters when inputed from file and it says the files differ after one input.

Thank you!

 
Campbell Ritchie
Marshal
Posts: 54838
155
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to the Ranch

Don't know, but I do know this is too difficult for the “Beginning” forum, so I shall move you.
Are you using a data input stream to read a text file? Don't. Use things with Reader in their name for text files (or a Scanner). Look in the Java™ Tutorials. Look for the parts about buffered streams and scanning. Note that you get a whole line or a whole word as a String like that, not individual chars, but you can easily get chars out of a String.
Is there any chance that the two files have different encodings? What are the Chinese characters? Have you printed them after reading to see they are the same? I presume they are Unicode supplemental code points (i.e. their Unicode number is > 0xffff), so you have to read them as a code point or two chars. I would have thought however that the same character would return the same code points wherever they are found.
 
Campbell Ritchie
Marshal
Posts: 54838
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do Chinese characters have a concept of capital/small letters?
 
Dave Tolls
Ranch Hand
Posts: 2715
30
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I''m not sure, maybe Charlie could confirm, but it sounds to me like the characters are coming out as Chinese, but that's not what they are in the file?

In any case, that is almost certainly down to reading the file as bytes rather than as a text file (ie using a Stream instead of a Reader, as you said).
 
Paul Clapham
Sheriff
Posts: 22258
38
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dave Tolls wrote:In any case, that is almost certainly down to reading the file as bytes rather than as a text file (ie using a Stream instead of a Reader, as you said).


I agree. It looks like the file is encoded in one of the ordinary ASCII-like encodings, so it has one byte per character, but then the readChar() method of DataInputStream reads two bytes and tries to interpret them as a Unicode character. Hilarity (or not) ensues.
 
Junilu Lacar
Sheriff
Posts: 10929
158
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@OP

Have a look at the documentation here: https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html

That defines the behavior of the input stream you're using.  If you look are the readChar() documentation, you'll see that it throws a EOFException so I don't think that your catch (IOException) is actually giving an appropriate message in all cases and your do-while condition might be pointless.
 
Knute Snortum
Sheriff
Posts: 3812
91
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think readChar() is the correct method to use, since a Chinese character can't be represented as a char, can it? 
 
Junilu Lacar
Sheriff
Posts: 10929
158
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:I don't think readChar() is the correct method to use, since a Chinese character can't be represented as a char, can it? 

This

prints out
台臺颱

Assuming that the files have Chinese characters represented as unicode, then yes, they can be read in as char values.

And in case you're wondering, this produces the same output:
 
Charlie Lynch
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Everything is working, just one question. Will the files opened by the two programs below close the files automatically, thats what I'm made to believe but if someone with a good understanding could give me an explanation that would be nice.

Campbell Ritchie wrote:Welcome to the Ranch

Don't know, but I do know this is too difficult for the “Beginning” forum, so I shall move you.
Are you using a data input stream to read a text file? Don't. Use things with Reader in their name for text files (or a Scanner). Look in the Java™ Tutorials. Look for the parts about buffered streams and scanning. Note that you get a whole line or a whole word as a String like that, not individual chars, but you can easily get chars out of a String.
Is there any chance that the two files have different encodings? What are the Chinese characters? Have you printed them after reading to see they are the same? I presume they are Unicode supplemental code points (i.e. their Unicode number is > 0xffff), so you have to read them as a code point or two chars. I would have thought however that the same character would return the same code points wherever they are found.

Campbell Ritchie wrote:Do Chinese characters have a concept of capital/small letters?


Sorry I forgot to clarify that my text files are in English but the input was taken in as some Chinese characters. This was because I was using UTF-8 instead of UTF-16 and it was reading two characters as one. I re-wrote the program using Readers(below) and also fixed the original program. I've understood to use Readers with text files and input streams for reading binary data. Thanks!


Dave Tolls wrote:I''m not sure, maybe Charlie could confirm, but it sounds to me like the characters are coming out as Chinese, but that's not what they are in the file?

In any case, that is almost certainly down to reading the file as bytes rather than as a text file (ie using a Stream instead of a Reader, as you said).


Ye that is the case, the characters weren't Chinese, I was using UFT-8 encoding instead of UTF-16, changing that fixed the reading part of the program.

Paul Clapham wrote:
Dave Tolls wrote:In any case, that is almost certainly down to reading the file as bytes rather than as a text file (ie using a Stream instead of a Reader, as you said).


I agree. It looks like the file is encoded in one of the ordinary ASCII-like encodings, so it has one byte per character, but then the readChar() method of DataInputStream reads two bytes and tries to interpret them as a Unicode character. Hilarity (or not) ensues.


Yup, I was using UTF-8 encoding instead of UTF-16 and had not realised that two bytes were read. Thanks!

Junilu Lacar wrote:@OP

Have a look at the documentation here: https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html

That defines the behavior of the input stream you're using.  If you look are the readChar() documentation, you'll see that it throws a EOFException so I don't think that your catch (IOException) is actually giving an appropriate message in all cases and your do-while condition might be pointless.


You're right! That do-while statement was useless, I thought -1 was returned when EOF was reached. Thank you


So I took everyones suggestion, and it works as intended, thank you! My first problem was encoding my input files with UTF-8 instead of UTF-16 as pointed out by few of you. However UTF-8 files can be compared but two characters will be read instead of one. Secondly I thought -1 was returned when EOF was reached, but an exception is actually thrown.

Original, fixed.


Re-wrote using readers


Thanks!
 
Campbell Ritchie
Marshal
Posts: 54838
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well done

But avoid while (true) ...
Write this, which appears strange but avoids the break;W ell, it avoids the first break;

I do not know whether the tests for isReady() are necessary. I haven't seen it before.
I would have had two loops each reading the lines into a List<String>. That will simplify the loop headers because you will only need one null test at a time. Also, you have all the details in memory and you can compare the lines afterwards outwith the confines of that loop.

Again well done You have sorted out that problem with a little help from us and kept us up to date with what is happening.
 
Paul Clapham
Sheriff
Posts: 22258
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Charlie's corrected code is definitely on the right track, but it will crash if the two files don't have the same number of lines. Your suggested modification (I think) won't crash in that case, but it also won't report that the files differ.
 
Charlie Lynch
Greenhorn
Posts: 4
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:Well done :)

But avoid while (true) ...
Write this, which appears strange but avoids the break;W ell, it avoids the first break; :wink:

I do not know whether the tests for isReady() are necessary. I haven't seen it before.
I would have had two loops each reading the lines into a List<String>. That will simplify the loop headers because you will only need one null test at a time. Also, you have all the details in memory and you can compare the lines afterwards outwith the confines of that loop.

Again well done :) You have sorted out that problem with a little help from us and kept us up to date with what is happening.


Thank you! I've re-written the program as you've suggested and it fixes the issue stated below as well. Any suggestions would be helpful!

Paul Clapham wrote:Charlie's corrected code is definitely on the right track, but it will crash if the two files don't have the same number of lines. Your suggested modification (I think) won't crash in that case, but it also won't report that the files differ.

Both of you're statements are correct, just tested. How would you write it?

I've re-written the program as suggest above by Campbell and it fixes the file line issue.



Thank you everyone!
 
Paul Clapham
Sheriff
Posts: 22258
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That looks better. It seems to me that you don't need to make a special case for empty files, though; two empty files are equal and an empty file is never equal to a non-empty file, and the rest of the code you have there will say so without modification. Unless your requirements say you have to do that, of course.

One more minor thing:



This is meant to process the entries in the list sequentially starting at index 0, and that's what it does. However the time-honoured way to write that in Java, and in other languages too, is like this:



This idiom has probably been used about a billion times to date, so when a Java programmer sees it they will instantly understand what it does. Your version, although it's equally correct, will cause them to stop and say "Huh?" before they realize what it does.

 
Campbell Ritchie
Marshal
Posts: 54838
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agree that is a lot better. I would suggest you reconsider the names of some of those variables, however. Calling a List file1 can cause confusion especially when you are reading from files.
Once you have the two Lists, you can do much more than simply test whether they are equal. You can probably do that with the equals() method of the Lists. You can find the lines which differ; you can even find whether a line has been added or removed, like a diff program.
 
Charlie Lynch
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:That looks better. It seems to me that you don't need to make a special case for empty files, though; two empty files are equal and an empty file is never equal to a non-empty file, and the rest of the code you have there will say so without modification. Unless your requirements say you have to do that, of course.

One more minor thing:



This is meant to process the entries in the list sequentially starting at index 0, and that's what it does. However the time-honoured way to write that in Java, and in other languages too, is like this:



This idiom has probably been used about a billion times to date, so when a Java programmer sees it they will instantly understand what it does. Your version, although it's equally correct, will cause them to stop and say "Huh?" before they realize what it does.



Yeah, habit from C++. I've just started learning Java a few days ago so need to get used to some of the idioms used. Thanks!

Campbell Ritchie wrote:Agree that is a lot better. I would suggest you reconsider the names of some of those variables, however. Calling a List file1 can cause confusion especially when you are reading from files.
Once you have the two Lists, you can do much more than simply test whether they are equal. You can probably do that with the equals() method of the Lists. You can find the lines which differ; you can even find whether a line has been added or removed, like a diff program.


What names would you suggest? Didn't realise I could use the List's equals because I thought it would use the == operator for the Strings and compare the references and not the contents, but I seem to be wrong. Yeah the plan is to make it work like the diff program as I progress(and get the time). Thanks!
 
Campbell Ritchie
Marshal
Posts: 54838
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is documentation explaining how the equals() method works. There is no need to think how a method works when the details are available like that.
You might call the lisnts lines1 and lines2 but I am feeling uninspired about variable names just at the moment.
Your code all in the main method is confusing. Paul C is right that a single loop will not work correctly; it will for example fail to read anything if one of the files is empty and the other contains text, and will return an incorrect result. You shou‍ld have several methods, one to read the file into a List (called twice), and some other method to look for differences. If you simply need to see whether the two files contain the same text, List#equals will probably do.

Beware of files created on different systems. A Windows file will have \r\n at the end of each line and one created on Unix/OSX/Linux will have \n there, so they will have a different size and will be technically different, even if the text they contain is the same. This technique will lose all the line end characters.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!