Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Unable to read Arabic data

 
Nikhil Bansal
Ranch Hand
Posts: 60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I do have file which contains data in EBCIDC format. The data is English as well as Arabic. Now my task is to convert this data in the UTF-8 format.

Well,when I am reading the data from the input file, I am able to get the corresponding Hex values (EBCIDC) and by mapping them to Hex values of ASCII the conversion for English is being done.

But the problem is with Arabic.For example there are hex values like 064E,064F for Arabic characters. When I am sending them as o/p then I am getting some junk characters like ?

Plz guys,it's a request if therez some sample code.......plz post it here..........it will be of great help.

Thanks in advance

Nikhil
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My understanding is that there are many EBCDIC encodings possible, and you would need to know exactly which EBCDIC encoding is used here. For Arabic, a common choice is apparently Cp420. This is supported in Java - on older JDK versions you may need to include the file charsets.jar in your classpath. Try something like this:

If the encoding is something other than Cp420, you may or may not have to find additional encoding support somewhere. You may find this documentation useful.
 
Nikhil Bansal
Ranch Hand
Posts: 60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Foll is my code. Here I am reading bytes of data,specifying that it is in cp420 (EBCIDC Arabic) format and then writing to the o/p file in UTF-8 format.

However, there seems to be some problem.There are some junk characters getting written to the file esp the one's where the hex value is alphanumeric for ex 8D,8C etc. If the hex value is numeric then the o/p is correct.

What am I doing wrong in the code.

Also I need to insert a carriage return after every bytes of data read.

Plz help me guys

Nikhil

import java.io.*;

public class ReadBinaryData {

public static void main(String args[]){

try{
File file = new File("D:\\MYDATA.DATASETS");
InputStream is = new FileInputStream(file);

File outfile = new File( "D:\\testHexFile.txt" );
FileOutputStream fout = new FileOutputStream( outfile);

String s = null;
long length = file.length();

if (length > Integer.MAX_VALUE) {
System.out.println("File is too large");
System.exit(0);
}


byte[] bytes = new byte[(int)length];

// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}

// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}

// Close the input stream and return bytes
is.close();
s = new String( bytes, "cp420" );


byte[] output = s.getBytes( "UTF-8" );

fout.write(output);
fout.close();
// return bytes;
}catch(Exception e){
System.out.println("Exception e"+e.toString());

}



}// End of main

}//End of class
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Pie
Posts: 15369
40
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Nikhil Bansal:
But the problem is with Arabic.For example there are hex values like 064E,064F for Arabic characters. When I am sending them as o/p then I am getting some junk characters like ?


If you are sure that those codes are the correct Unicode codes for the characters, then the problem is not in the EBCDIC to Unicode conversion.

Ofcourse you need to have a font that contains those Unicode characters, otherwise you can't display them. Where is your output going, to a Unicode text file? What software are you using to view the output? Are you using a font that contains the Arabic characters?
 
Nikhil Bansal
Ranch Hand
Posts: 60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jesper,

I am written the o/p to a text file with encoding specified as UTF-8. I am viewing the file in Notepad and also in Microsoft word. I do have Windows XP as the OS.

I am viewing the Arabic o/p with font Arabic Transparent.

Can you plz go thru the code. Let me know if I am missing something or doing something wrong.

Regards

Nikhil
 
Paul Clapham
Sheriff
Posts: 21142
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You have two steps in your little piece of code there. The first reads the bytes and attempts to convert them to chars using the CP420 charset, and the second converts those chars to bytes using the UTF-8 charset and writes them out.

Personally I would have used an InputStreamReader that specified CP420 and an OutputStreamWriter that specified UTF-8 rather than the low-level byte-fiddling that you have there. But that shouldn't matter, because it should end up with the same result.

The problem is that you have "?" appearing in the final result where it should not appear. And this always means an encoding or decoding failure. So, which of the two steps is producing these ? characters?
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Nikhil,

I've tried your piece of code ... You were just having one problem, you've chosen a wrong Encoding for Arabic.
Just replace the "cp420" Encoding with "Cp1256" and ISA it will work fine.

Best regards ,
 
Nitesh Kant
Bartender
Posts: 1638
IntelliJ IDE Java MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Please do not DontWakeTheZombies
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've already checked the date of the post, and i know it's old ... But i was facing the same problem so i took some time to find a solution.
I know others will google to find this post so i didn't want them to take some time like me to fix it.

Just wanted to make the Ranch post better
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic