• Post Reply Bookmark Topic Watch Topic
  • New Topic

Reading a UTF-8 Encoded File  RSS feed

 
Oliver Moore
Ranch Hand
Posts: 44
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I've written a program which parses several templates inserting relevant values and then writes them to a new file.

When I read a UTF-8 encoded file created using notepad and saved as UTF-8 encoded text, a junk (unknown) character gets inserted as the first character of the resultant string created from the incoming stream.

If I write the String containing the junk character out to a file after manipulation, the file is created with no problem and the junk character is not shown.

If I insert the String representing the file into another String and then write a file out containing both Strings, the first character before the inserted String will be junk.

As you can see from my code, I'm cutting the first character off of any incoming file to resolve this. However, in some cases I need to re-read and insert Strings into files that have been written using my output code (below). If I read a file which has been created using my code, the junk character is not present on reading the file.

Is this known behaviour (E.g. notepad doesn't implement utf-8 files in the same way as Java) or is there an error in my code that is causing it? I can solve this by noting whether the file I'm reading is a native file or has been created in Windows but this seems to be a work around rather than solving the problem directly.

Any help would be much appreciated.

Regards,

Oliver

My code for reading the file:



Code for Writing File
 
David Harkness
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you open the file in a text editor that shows the contents in hex so you can see exactly what the junk character is? And what is it? Very bizarre.

BTW, in scanning your reading code I noticed this:Note that javac uses temporary StringBuffers to concatenate Strings, so you're creating an unneeded StringBuffer for each line of the file. Try this:
 
Oliver Moore
Ranch Hand
Posts: 44
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi David,

Thanks for the coding tip.

I've had a look at the files being created using PSPad and the characters that seem to be present are ��� (Hex EF BB BF). If I look at the base file I'm parsing in PSPad, these charaters are not visible (in Hex or other).

If I look at the first line of any Notepad created file I parse in by printing to the console, the first charater is always ? (I assmume meaning unknown character).

If I modify the String and the write it out to file, it's fine. It's just when I insert one parsed template into another that I see these charaters carried into the file. I assume this is because the file writer discards the junk characters prior to writing if they're at the start of the string to be writen, but will carry them if they're present in the String somewhere.

I hope this information is of help.
 
Oliver Moore
Ranch Hand
Posts: 44
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

Having looked at this a bit further, I think I'm having a problem with Notepad inserting a UTF8 byte-order mark of EF=BB=BF in any files saved as UTF8.

I'm assuming that Java doesn't add this BOM when it writes a UTF8 file.

I'm just suprised that Java parses in the BOM when it reades the template, thus causing the problem I'm seeing. Is anyone aware of a way round this? I'm going to try saving the files without BOM (I think Jedit can do this) but I'd like to know if there's a better way.

I guess I could examine the incoming stream in binary and look for the BOM coming in and filter if present.

Here's a couple of useful links I found whilst looking at this issue

Sun Forums
Unicode Transformation Formats: UTF-8 & Co.
 
Oliver Moore
Ranch Hand
Posts: 44
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

Apparently this is a long standing bug

UTF-8 encoding does not recognize initial BOM

This class here seeems like a reasonable solution:

UnicodeReader and UnicodeInputStream

Hope this is of use.
 
David Harkness
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Good investigative work on your part! Looks like my work here is done.

Seriously, thank you for following up with the solution in case anyone else comes across the same problem.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!