Forum:

Java in General

string length different in UTF-16 vs UTF-8

Puspender Tanwar

Ranch Hand

Posts: 658

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Why the length of string vary in different encoding? Here is an example where same string is used in different encoding.

output:
UTF-8 : 3 UTF-8_array : 3 UTF-16 : 2

Campbell Ritchie

Marshal

Posts: 79239

377

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Start by trying this:-

Java 8 (verified skill)
Skill verified by Jeanne Boyarsky

Puspender Tanwar

Ranch Hand

Posts: 658

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Puspender Tanwar wrote:

Campbell Ritchie wrote:Start by trying this:-System.out.println(string16);

Yes, that printed 2慢�
But why it changed the value?
I think I need to focus on some encoding blogs.

typo: that printed 慢�, means two characters of some other language.

Campbell Ritchie

Marshal

Posts: 79239

377

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

And how did you think you would get those characters. Try printing their Unicode values in hex:-Also try enlarging the array so it contains an even number of elements.

Tim Moores

Saloon Keeper

Posts: 7590

177

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

This should be of interest if you try counting Unicode characters in strings: https://www.ibm.com/developerworks/java/library/j-unicode/

Campbell Ritchie

Marshal

Posts: 79239

377

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Puspender Tanwar wrote:. . . But why it changed the value? . . .

Because you supplied the wrong encoding.

Campbell Ritchie

Marshal

Posts: 79239

377

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Shall I let you out of your misery?
You supplied the wrong enoding, so the constructor takes the first two elements of the array (\u0061 and \u0062) and puts them together to form \u6162 which looks like this:- 慢
Then you had \u0063 as the left half of the next char, which the runtime couldn't parse because it needs bytes in pairs, so it tried the unknown character \ufffd instead: this is \ufffd:- �

Paul Clapham

Marshal

Posts: 28226

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Let me just put that in a different light:

This says

Here's an array of bytes which represent a String encoded in UTF-16, please decode it using UTF-16 and give me the original String back.

So naturally if you pass it an array of bytes which don't represent a String encoded in UTF-16, the results may be surprising.

Java 8 (verified skill)
Skill verified by Paul Clapham

Tim Holloway

Saloon Keeper

Posts: 27807

196

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Bytes are not characters. It's sloppy to think of them as characters. Traditionally, a "byte" was the smallest distinctly addressable cell in memory, and could vary from as little as 4 or 6 bits (Decimal memory, BCDIC) to about 46 bits (CDC mainframes). Over time it became synonymous on most platforms with "octet", which is to say 8 bits.

Characters vary as well. Old IBM equipment often used 6-bit characters and often didn't include lower or upper case at all. ASCII is actually 7 bits plus 1 bit allowed for parity (a case of meta-data that wasn't always invisible). the original Teletypes operated on a 5-bit baudot code. And strictly speaking, Unicode is a 16-bit code.

In addition to the plethora of different "byte" and character sizes, many architectures support string compaction, which can be done in many ways, including the use of signal bits to indicate that the following character uses an extended number of bits or to shift to an alternate set of characters either for the next character or until a counter-shift is encountered.

In theory, a string of characters will always be the same number of characters, regardless of how many bytes they are encoded in, but I18N makes even that a tricky concept. I used to work with a system that imported data from an IBM mainframe and thus required EBCDIC-to-ASCII (Unicode) translation. But along the way, someone on the mainframe side had replaced the old IBM terminals with PCs running mainframe terminal emulators, and the emulators had abilities not part of the original IBM specs. So when people keyed in names like "Nuñez", the translation process was giving us "Nun~ez", Which gave our database indigestion, since names were ending up taking more space than we're allowed for, given that mainframes have a fondness for fixed-length strings.

The secret of how to be miserable is to constantly expect things are going to happen the way that they are "supposed" to happen.

You can have faith, which carries the understanding that you may be disappointed. Then there's being a willfully-blind idiot, which virtually guarantees it.

Ivan Jozsef Balazs

Rancher

Posts: 1044

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I once worked on a system with 36-bit machine words consisting of 4 9-bit bytes.

Tim Holloway

Saloon Keeper

Posts: 27807

196

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Ivan Jozsef Balazs wrote:I once worked on a system with 36-bit machine words consisting of 4 9-bit bytes.

That wasn't a CDC, was it? It sounds like the PLATO system they had when I was at school. I never worked with it, since it was a time-share from Florida State University and not part of the in-house data center, but they had terminals in the school library.

Ivan Jozsef Balazs

Rancher

Posts: 1044

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Tim Holloway wrote:
That wasn't a CDC, was it?

It was a Bull DPSx where x=8 or so... It was int, long, maybe even long long ago :-)
I programmed it in Fortran and I accidentally came across the bytes' bit width: it did not really matter on this level.

Tim Holloway

Saloon Keeper

Posts: 27807

196

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Ivan Jozsef Balazs wrote:

Tim Holloway wrote:
That wasn't a CDC, was it?

It was a Bull DPSx where x=8 or so... It was int, long, maybe even long long ago :-)
I programmed it in Fortran and I accidentally came across the bytes' bit width: it did not really matter on this level.

Ah. I worked with GCOS on Honeywell's short-lived minicomputer systems. Only system I ever worked with where you could do OS-level programming in COBOL. Also noteworthy because I discovered a comedic chain of events where a bug in the Fortran compiler caused it to send garbage to the linker, which would then crash and send trash to the print spooler, which then locked up the system.

I also worked with Prime minicomputers, and Prime was founded by a group of ex-Honeywell people. That was a 16-bit word machine, so character manipulations meant either wasting 1 byte out of each word (on a 128KB RAM!) or extra logic to pack and retrieve 2 characters per word.

But returning to the original topic of this thread, there are some interesting things in the Wikipedia article on the Bull machine at https://en.wikipedia.org/wiki/General_Comprehensive_Operating_System and while I don't think there's a link straight to the part about storage, it's a topic named "GCOS8 Storage Units". As they said, they're "colourful"

Puspender Tanwar

Ranch Hand

Posts: 658

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Thank you so much for your inputs. Your comments explained me why I am recieving un expected characters.

Campbell wrote: so the constructor takes the first two elements of the array (\u0061 and \u0062) and puts them together to form \u6162

How come java pick only 61 and 62? Any concepts behind this please?

This seems to be a whole complete world, could you guys please suggest me the starting point where/what I should learn in order to understand all this stuff.

I ran into some similar issue, where I am having some Japanese words in a XML file, which I need to generate a CSV file and then pass that file to a Oracle DB using Oracle SQL loader. Below is the code I am using, for which still some characters are not coming in DB. Though the data till CSV generation is valid, DB is disturbing the data. I checked the CHARACTERSET of DB as well, here are the configuration:

Input Sample(in XML): 横ツロ貿並ス提名あ今6燃だじでぼ話境じこか議真報ヤヒキ肉富ハヤリネ成生国気ツテイ載校だ組災ねだは今計ヲヘア行
in CSV : 横ツロ貿並ス提名あ今6燃だじでぼ話境じこか議真報ヤヒキ肉富ハヤリネ成生国気ツテイ載校だ組災ねだは今計ヲヘア行
output (in DB): 横ツロ貿並ス��今6燃��話境��議真報ヤヒキ肉富�ヤリ��生国気ツテイ載校�組��今計ヲヘア行

FileWriter fileWriter = new FileWriter("D:\\Test\\test.txt");
		File fXmlFile = new File("D:\\Test\\OdiServiceNow_LKM_XML2079_BULK.xml");

InputSource is = new InputSource(new FileReader(fXmlFile));
		is.setEncoding("UTF-8");

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
		DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
		Document doc = dBuilder.parse(is);
		
		doc.getDocumentElement().normalize();
		String[] starr = { "DESCRIPTION" };
		int j = 0;
		ArrayList arrayList = new ArrayList();
		NodeList nChildList = doc.getDocumentElement().getChildNodes();
		for (int i = 0; i < nChildList.getLength(); i++) {
			Node nNode = nChildList.item(i);
			String nodeName = nNode.getNodeName();
			if (!nodeName.equals("#text") && !nodeName.equals("fx_price")) {

/////some XML parsing logic//////
                          data = valueNode.getNodeValue();
			  fileWriter.write(data);  //write data to 
			  fileWriter.flush();
                         }
                 }

If the data is valid till CSV, this should definitely be the Database issue. Please give your views of this.

Puspender Tanwar

Ranch Hand

Posts: 658

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

This issue could be because of CSV as well. I noticed a wierd thing here. The Japnese data which come into CSV(from xml), doesn't seems to be valid if I directly look that data. Some weird characters got printed. But when I copy that weird Japanese data and paste it again in the same file itself, the complete valid data comes up. I cannot paste the invalid data as when I copy and paste it, that converts into valid data. Attaching the snapshot.

The invalid record in DB is because of this behaviour only.

csv.PNG

Tim Holloway

Saloon Keeper

Posts: 27807

196

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Puspender Tanwar wrote:Thank you so much for your inputs. Your comments explained me why I am recieving un expected characters.

Campbell wrote: so the constructor takes the first two elements of the array (\u0061 and \u0062) and puts them together to form \u6162

How come java pick only 61 and 62? Any concepts behind this please?

Because you're using a byte array as raw data into an encoder for 2-byte characters. It's going to combine pairs of bytes, and it doesn't care that each byte is in reality a single ASCII character.

Remember what I said. A byte is not the same thing as a character!

Also, you cannot look at the raw characters themselves - at least unless you have comic-book superpowers. So what you'll actually see depends on what program is making them visible, whether it's a web browser, an IDE or a desktop command shell window - or even Windows Notepad. And what displays will depend on 2 things:

1) Which character-set encoding the display program thinks that your text is encoded under.
2) What font (character glyphs) the display program is going to employ to render the text.

Very often the encoding for a command line will not be the same was what your web browser is using. More often than not, both of them will be using encodings and fonts that include the USASCII character set as a subset, but the remainder of the rendering process might very well be as different as Kannada and Cyrillic.

Paul Clapham

Marshal

Posts: 28226

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Puspender Tanwar wrote:If the data is valid till CSV, this should definitely be the Database issue. Please give your views of this.

My view is that when you've got data bouncing around like that, you should test each of the data transformations separately. Don't say "I did this and then that and then some more things and my encoding is screwed up". Test each step separately.

One big red flag in your posted code: you've got a FileWriter and you don't specify its encoding.

Paul Clapham

Marshal

Posts: 28226

I like...

posted 6 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Oh, and when you test a data transformation, make sure that what you're using for testing isn't the culprit for screwing up your encoding. Text editors, for example, usually can't detect the encoding of a text file and often can't be told what it is. Sometimes you need to fall back on a hex editor to see what's actually in a file.

Tutorials: start here. And if you look at the left-hand side of that page you'll see a list of more tutorials about related topics. Starting at "Unicode" might be an idea since it looks like you aren't clear on that yet.

Consider Paul's rocket mass heater.