Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

character convertion in java  RSS feed

 
nuthan kumar
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
I have an requirement which needs to read data from oracle tables and stores in sqlserver.
For this purpose i have written a java code, but using it i am not able to move japanees characters.
I am giving code below.

FileInputStream fis = new FileInputStream("testload.dat");
InputStreamReader ir = new InputStreamReader(fis, "UNICODE");
BufferedReader inme = new BufferedReader(ir);

String bogus = inme.readLine();

//charset for encoding
Charset charset = Charset.forName("SJIS");
CharsetEncoder encoder = charset.newEncoder();

//charset for decoding
Charset output_charset = Charset.forName("ISO-8859-15");
CharsetDecoder decoder = output_charset.newDecoder();


File testload.dat is saved in UNICODE format and it contains one line "テクノロジが海外での設計・開発力を拡充-kedar"

and output which i am look for is �e�N�m���W���C�O�������v�E�J�������g�[ and the above code is not giving output in required format.

Please help me
 
Rahul Bhattacharjee
Ranch Hand
Posts: 2308
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
making a buffer (specifically character buffer) out of the bytes from UTF stream of bytes is not a good idea.I would recomend you to directly make use of the utf bytes and encode those to japanees encoded bytes and store in the result file.

Put the bytes to a byte buffer and then use the charset decoded and encoder of japanees to again encode the decoded bytes to JIS format.
write those bytes to output.That should work.
Hope this helps ;-)
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nuthan, are you reading from a file, or from an Oracle database? If the data is from a DB, it would probably be best to use JDBC to read it and let the Oracle driver be responsible for decoding the bytes to characters. You would then still need to encode them in SJIS or ISO-8859-15 or whatever you're trying to do for the output, but decoding from whatever charset the DB uses should really be the DB's responsibility, I think. Unless the data is stores in a BLOB or RAW of some sort.

Similarly, when storing the data to SQL Server, you should probably just use JDBC and let the DB driver convert chars to bytes in whatever format the DB uses. Unless you are using some sort of BINARY datatype.

You seem to be using three different encodings here, UNICODE (which is effectively the same as UTF-16), SJIS, and ISO-8859-15. Why are there three? I would think that you would need at most two, one for decoding the input, and one for encoding the output. (And that's only if the two databases don't do this for you.) What role do SJIS and ISO-8859-15 have here? And why is there a file in UNICODE - is it data from the Oracle DB maybe? Is it correct? It's hard for me to say more without a better understanding of what you're trying to do.

[Rahul]: making a buffer (specifically character buffer) out of the bytes from UTF stream of bytes is not a good idea.

Um, why not? Most of the conversion techniques I know of would either use a CharBuffer or a char[] (which I would call a character buffer). What's wrong with that? Alternately you could use new String(byte[] buffer, String charset) to decode, and getBytes(String charset) to encode, but that seems no more efficient to me - probably less efficient. So what are you advocating here, and why?
 
Rahul Bhattacharjee
Ranch Hand
Posts: 2308
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
FileInputStream fis = new FileInputStream("testload.dat");
InputStreamReader ir = new InputStreamReader(fis, "UNICODE");
BufferedReader inme = new BufferedReader(ir);

String bogus = inme.readLine();


Byte calculations / manupulations are always prefered over character manupulation.If you have access to a byte stream and you know the encoding which has been used , then you can make well use of the bytes in any language.
character manupulation makes it restricted to JAVA.(might be good for within application transfer to data.)
So when designing a application (say a server application and one of the design goals is that it should support I18N )then always bytes are decipated in UTF8.In this case any application can make use of this byte stream as the clients are aware of the encoding of the bytestream (utf)
I am recently started using java IO , so please correct me if I am wrong.
 
nuthan kumar
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rahul,jim
Thanks a lot for your response.

I have tried with byte buffer as suggested by rahul and still no use.

yes you are correct Jim, i am trying to read data from oracle and storing it to sql server. and I am dispalying sql server data in web page. this is the requirement.

As I don't have oracle database access i am using File as intermediate media. The team which provided file to me are telling that they stored the file in Unicode.I have tried in different ways and the using code which I pasted in my previous post i am able to see atleast some japanese characters in web browser.
Please suggest some workaround to overcome this problem.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is no encoding called "UNICODE". Your database team should have given you an actual encoding name, like "UTF-8" or "UTF-16LE", not "UNICODE".
 
Rahul Bhattacharjee
Ranch Hand
Posts: 2308
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are you getting UnsupportedEncodingException while creating InputStreamReader.
Check the javadoc for Class Charset to get a set of charset.Certainly UNICODE is not a valid character set.
I wouder how are you not getting exception.Are you catching exception and eating like as below.

try{
..
..
}catch(Exception e){}
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Sun JDKs apparently consider "UNICODE" to a be a synonym for UTF-16. I would consider it a warning sign that the guys who made the file may not really know what encoding they're using. Ask if they mean UTF-16 - if they say yes, then great. If they say "what?", then assume they don't really know what they're doing - find someone else who does, or study their code yourself. You might try some other common Unicode-based alternatives instead: UTF-8, UTF-16BE, UTF-16LE.

Another very suspicious thing I notice is the mention of ISO-8859-15 above. What's that for? If you're representing Japanese characters, they can't possibly be encoded (or decoded) correctly using ISO-8859-15. That's an 8-bit encoding scheme, it can't represent more than 256 characters. (Japanese of course has many more than this.) So, what's the purpose of ISO-8859-15 here?

It sounds like there are four or five different encodings being used here for various purposes. I think you need to simplify things a bit by only worrying about one or two at a time. Forget about what encoding is used in Oracle (which you don't have direct access to anyway, at the moment) or what encoding is used in SQL Server (which maybe you can access, but there will be several more steps involved here). You've got a single file, allegedly written in "UNICODE", and you want to read it and render its contents in HTML somehow. Here's a simple way to try that:

This allows you to create a simple HTML file on your local machine, with no web server or anythign else to worry about. You can open it with your browser and see how it looks. If it's no good, the probablem is most likely that the testload.dat encoding is not really in UTF-8. So try other encodings until something works. Then go back and tell the people who gave you the file, what their encoding really is.

[Rahul]: Byte calculations / manupulations are always prefered over character manupulation.

Um, so, I really can't agree with that. It's possible that they may be faster, but in a case like this, you really need to understand the details of the encoding to work with the bytes reliably. Unless you're only using US-ASCII characters (which are admittedly very common), the process is very error-prone. For things like UTF-8, UTF-16, and SJIS, I would strongly recommend do not try to work with the raw bytes yourself - unless you really need to, and know what you are doing. It's much easier to rely on the Java libraries to convert these bytes to characters, using things like Charset, InputStreamReader, OutputStreamWriter, etc.
 
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!