This week's book giveaway is in the Cloud/Virtualization forum.
We're giving away four copies of Mastering Corda: Blockchain for Java Developers and have Jamiel Sheikh on-line!
See this thread for details.
Win a copy of Mastering Corda: Blockchain for Java Developers this week in the Cloud/Virtualization forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Bear Bibeault
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Stephan van Hulst
  • Jj Roberts
  • Carey Brown
Bartenders:
  • salvin francis
  • Frits Walraven
  • Piet Souris

Reading a UTF-8 file & Writing it to a UTF-8 file

 
Ranch Hand
Posts: 884
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys,

I'm trying to read an XML file that has Chinese characters in it & then output it to another XML file. But my output displays ??? when I open it with wordpad/IE.

Using WordPad, I pasted some Chinese characters which I got from website & saved the file as an Unicode document (infile.xml). The input file displays the characters correctly when I opened it with wordpad/IE.

I'd specified the encoding type to be UTF-8 for the output. Am I missing something here?

Any help is much appreciated. Thanks.

 
Ranch Hand
Posts: 805
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi, Cheng -

Chinese characters fall under the Unicode 16-bit character set. By opening your read file as UTF-8, you are in effect splitting the first 8 bits from the input character thus destroying the format of the original 16-bit character. So, if I understand what's going on here correctly, you're reading a 16-bit Unicode character as two 8-bit characters. Possibly, when you re-write the 8-bit characters, the order or endian-ness of the characters gets reversed.

Hope that helps. (Or even makes sense. I need another cup of coffee...)
 
Chengwei Lee
Ranch Hand
Posts: 884
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi folks,

Am still having problems with trying to read a chinese file & output it to another chinese file. This time round, I've used the GB2312 character encoding instead.

The codes remain largely the same, but my input & output files are now htm.



Somehow, when I view the outfile.htm using IE, it still cannot display the characters properly.

My infile.htm simply consists of 2 characters: 密码

What am I doing wrong here or missing?

Any suggestions or help is greatly appreciated.

Thanks!
 
Pay attention! Tiny ad!
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic