• Post Reply Bookmark Topic Watch Topic
  • New Topic

Converting unicode chars  RSS feed

 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd be glad if someone could provide me with some code that would:
read a file, convert all unicode chars to their UTF-8 character code (not sure I'm using the correct words here) and write the whole thing to a file
char examples: //without spaces between & and #
' ==> & #039;
� ==> & #228;

Meaning that a file:
## start file ##
Hyv�� p�iv�� means good morning, doesn't it?
## end file ##
would be converted to: //without spaces between & and #
## start file ##
Hyv& #228;& #228; p& #228;iv& #228;& #228; means good morning, doesn& #039;t it?
## end file ##

Thanks in advance for any help.
<added>what's the point of the CODE tag if it converts charcters? that's exactly what it shouldn't</added>

------------------
mwaf @ http://dmoz.org/
[This message has been edited by Jon Krogell (edited March 06, 2001).]
 
Jim Baiter
Ranch Hand
Posts: 532
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think if for each line you do:
byte[] b = (BufferedReader.readLine()).getBytes("UTF-8");
then if you want a String just do:
String uniStr = new String(b, "UTF-8");
 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the response. Although, it really didn't do anything. Meanwhile, I had actually solved this myself (while my internet connection was down). My solution probably isn't the most effective one so I'd be glad to have any comments on it.

Just noticed, it won't write anything to the file (except creating it).
Did a slight change, I still don't see why this doesn't work (I have a feeling it worked earlier today).
OK, works now, added: file_out.flush();.

[This message has been edited by Jon Krogell (edited March 08, 2001).]
 
Mark Savory
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Jon,
I think that maybe your not using the right words when you say "convert all unicode chars to their UTF-8 character code". When you read in the original file, I believe that your Latin-1 characters along with the rest of the ASCII text are already encoded in UTF-8. "&228;" is some sort of escape sequence. I'd like to know what your doing with the final file.
 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, the file I'm reading is encoded in UTF-8, and I want to escape some chars. I need to do this because:

Gives me a "Malformed UTF-8 char -- is an XML encoding declaration missing?" exception (it's a SAXParseException if I recall correctly).
 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK, I figured out a much faster way for doing the same as the previous code.

Again, if someone has comments on the code, how to improve it and such, suggestions are welcome.
Did a sllight modification, my getBytes trick doesn't work with & chars. I also added a few spaces so that the code would display correctly, except for the spaces, of course.
[This message has been edited by Jon Krogell (edited March 09, 2001).]
Aarrgh, forget this (anyone actually reading this thread), I get a strange bug (it'll print & amp; til end of time) with the real file (worked fine with a test file). Also, I have taken in consideration that a line containing UTF-8 chars (or whatever) might also contain & chars.
[This message has been edited by Jon Krogell (edited March 09, 2001).]
Got it working (at last), splitted it to two methods and it works now.

[This message has been edited by Jon Krogell (edited March 10, 2001).]
 
Mark Savory
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jon,
We had this same parsing error. We fixed it by one simple change to the prolog, for example:
<?xml version="1.0" encoding="ISO-8859-1"?>
P.S.
I think that maybe your question would have been answer faster had you posted it in the XML topic area.
[This message has been edited by Mark Savory (edited March 12, 2001).]
[This message has been edited by Mark Savory (edited March 12, 2001).]
 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unfortunately it isn't that easy in my case as the file is actually encoded in UTF-8, so for example '�' will display as '?'.
I think that maybe your question would have been answer faster had you posted it in the XML topic area.

Well, maybe, I was considering posting there instead but decided this forum as what I wanted to do (without looking at the big picture) is not related to XML but to I/O.

BTW, I'm getting an java.lang.OutOfMemoryErro r exception so I'm completely stuck now. (The XML file I'm parsing is 37 Mb.)
[This message has been edited by Jon Krogell (edited March 12, 2001).]
 
Mark Savory
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In our case, we have a java String object that is also UTF-8 encoded. If you haven't already, would you please try using the prolog that I mentioned in my previous reply?
 
Jon Krogell
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you haven't already, would you please try using the prolog that I mentioned in my previous reply?

I have done that, and as I said, an "�" in the file was displayed as a "?" when S.o.printing the root element. (Didn't do this on the 37Mb file.)
 
Mark Savory
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jon,
It's possible that if you S.o.print to the DOS prompt that the proper font is not available to display your character correctly. If you have Internet Explorer 5 or later, you can open your XML file(with the prolog set properly) and IE will parse and display your document. This is a good tool for determining the validity and well-formed-ness of any XML document.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!