• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Ron McLeod
  • paul wheaton
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
  • Himai Minh
Bartenders:

Problems Reading UTF-8 File

 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys I need clarify this issue

I am working with the SunFtpClient Class in a project that involve
download file contents from Ftp server on Unix Machine. I create some
Files in notepad, write the content and then I save as UTF-8 encoding.

Next I transfer the file content from my machine to the ftp server
in binary mode. Here everything is Ok. But the problem is right here
I execute this piece of code and kaboom the problem appears. Let�s review
the code and next I specify the problem



The Message.txt Content is the following
One Two Three Four Five Six

The LocalMessage.txt Content is the following
?One Two Three Four Five Six

SomeBody Could Ask What is the problem?

The problem is that although I use UTF-8 in InputStreamreader as the Convert
Encoding ,the BOM bytes are not filtering and I suppose that the ? character in the content of file LocalMessage.txt is the result of those bytes. Why InputStreamReader converter=new InputStreamReader(ftp.get("Message.txt"),"UTF-8"); is not working well

I appreciate your comments


















 
Sheriff
Posts: 28408
101
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You are correct. When Notepad writes a file in UTF-8 encoding, it puts the BOM (byte order mark) at the beginning of the file. This is unnecessary since byte ordering is unambiguous in an 8-bit encoding, but it does it anyway. So the BOM is there.

You would think that a Java Reader that is decoding from UTF-8 would notice that there's a BOM at the beginning of the file, since the UTF-8 specification says it may be there. But no, it doesn't. So it's up to you to read that byte (or character) and ignore it.
 
reply
    Bookmark Topic Watch Topic
  • New Topic