• Post Reply Bookmark Topic Watch Topic
  • New Topic

Problem with Chinese(PRC) words sorting in Java  RSS feed

 
Serge Adzinets
Ranch Hand
Posts: 166
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
For example, the right order of Chinese words is:
巴西
代码
黑屏
能力
问题
The application sorts words with:
Collections.sort( words, Collator.getInstance( new Locale( "zh", "CN"));
The result is :
代码
巴西
能力
问题
黑屏

Does anybody know how to resolve this issue?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think the problem is that you're representing
巴西
代码
黑屏
能力
问题
as
巴西
代码
黑屏
能力
问题
That's fine for displaying these chars as HTML. But the Collator looks at these and sees five strings of 16 characters each, none of them Chinese, and sorts them accordingly:
代码
巴西
能力
问题
黑屏
which renders as
代码
巴西
能力
问题
黑屏
In order to get the sorting you want, you need to sort Strings that represent Chinese characters using Unicode, not using HTML character entities. (The latter also use Unicode, but with an extra level of indirection.) Your example should consist of five characters of only two characters each. How do you get this? Well, depends on your data. If it's all in HTML entities when you first see it, you're going to have to convert it back. An HTML or XML parser can do this for you. Or if your data was in Unicode to begin with, that will be easier - defer converting it to HTML entities until after you've sorted it.
Hope that helps...
[ February 03, 2004: Message edited by: Jim Yingst ]
 
Serge Adzinets
Ranch Hand
Posts: 166
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, Jim,
I've found the reason recently. The problem was in the JDK version. It appears in JDK v.1.4.1, but disappears in 1.4.2.
Now, a few words about the data flow in our system. First, user enters data in the html form. Then, the form is submitted, a servlet extracts the data, calls an EJB bean, which stores it in db. To present the data to the user, the data is read from db and sorted in java.
I don't think we use any extra conversion for unicode characters. What we've done to support i18n is setting request and response charset to UTF8 and setting the DB up to support unicode characters.
Thanks for your help,
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!