• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Rob Spoor
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Henry Wong
  • Liutauras Vilda
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Tim Holloway
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Mikalai Zaikin
  • Piet Souris

what is new String("中文".getBytes(), "UTF-8") do exactly?

 
Ranch Hand
Posts: 257
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
my jvm flie.encoding system property is Cp1252
I see the following code (with some modification):


What is the above code

doing exactly?

is this the same if I change System property file.encoding=UTF-8 and then

 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Likes 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
To understand what this does, you have to know a little bit about how computers deal with text and what character encoding is.

Computers ultimately store everything in their memory as ones and zeros, or, at a slightly higher level of abstraction, as numbers. In the memory as a computer, text is also represented as numbers. Ofcourse, to do this, you have to have an agreement about what number means what character. That agreement is a character encoding. For example, take ASCII, which is one of the oldest character encodings. In ASCII, the number 65 means the letter 'A', 66 means 'B', 67 means 'C' etc.

ASCII is very limited - it uses 7 bits per character so that it can represent only 128 different characters. Over the years, people have invented many other character encodings besides ASCII, to be able to represent more characters besides just the standard Latin alphabet plus a few extra characters.

One of the most used character encodings nowadays is UTF-8, which is a specific encoding for characters from the Unicode character set. Note that UTF-8 is a variable-length encoding; each character takes up between 1 and 4 bytes when encoded with UTF-8.

Now, let's look at your line of code. To start, I'll tell you that this line of code is most likely wrong, and you'll see why.

It is doing two things:

1. "中文字".getBytes() - This takes the string "中文字" and returns an array of bytes that represent the string encoded with the default character encoding of the system. You said that the default encoding of your system is Cp1252 (a Microsoft Windows-specific encoding).

2. new String(bytes, "UTF-8") - This takes the bytes and creates a new String out of it, decoding the bytes using the UTF-8 character set. This is wrong, because the bytes were encoded using Cp1252 and not UTF-8, as we saw in the first step. You will get a string that is likely to contain wrong characters, or you might even get an exception when the bytes do not form a valid UTF-8 sequence.

So, summarizing what this does:

1. Take a string and convert it to bytes using the default character encoding (Cp1252)
2. Convert those bytes back to a string, telling the computer that the bytes were encoded with UTF-8 - which is wrong, because in step 1 you encoded them with Cp1252 instead of UTF-8

Just saying that the bytes are UTF-8 doesn't make them UTF-8. It's as if I write down a sentence in Dutch and then tell my English friend, "Can you read this? It's written in English".

I don't know what the intention was of the person who wrote this code, but (s)he probably did not understand character encoding very well and didn't really know what (s)he was doing. It looks like a case of cargo cult programming, where the programmer just copied and pasted a "magic formula" without understanding.

Probably (s)he wanted to store the text as UTF-8 in the database. The best way to do that is by configuring the database to store strings using UTF-8 (has nothing to do with the Java code) and then just do pstmt.setString(1, "中文字"); without the unnecessary manual (and wrong) conversion. You do not need to set the default character encoding to UTF-8 for this.
 
Bartender
Posts: 4633
182
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Excellent reply, Jesper, thanks
 
peter tong
Ranch Hand
Posts: 257
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
if each character takes up between 1 and 4 bytes in UTF-8, then if the field in database is defined as varchar2(100 bytes), then


x should be less than or equal to 25 characters? (25 * 4 bytes = 100 bytes)
 
Jesper de Jong
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It depends. If the string x contains only characters that take up one byte each, then it can be up to 100 characters in length. But if the string contains characters that are 4 bytes when encoded with UTF-8, then you can fit at most 25 of those characters into that database field.

UTF-8 is a variable length encoding. Some characters (such as the letters of the Latin alphabet) take up only 1 byte per character. But other characters may take up 2, 3 or 4 bytes each.
 
Saloon Keeper
Posts: 13280
292
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You should also keep in mind that 'character' isn't a well defined concept.

The characters that Jesper is referring to are actually called 'code points'. Most code points correspond to a visual character on your screen, but some code points actually modify other code points, to make a combined character. That means that some visual characters are made up of multiple code points, which in turn can be made up of a variable number of bytes, depending on the encoding used.

For a detailed but easily readable article on the subject, check out this site: http://utf8everywhere.org/
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic