Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Java Internationalization  RSS feed

 
Kodo Tan
Ranch Hand
Posts: 105
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all
I was using some Java internationalization packages and found that the API I wrote to count the number of words depend on whether there are white spaces in my unicode string.
Basically, my program is as follows:
import java.text.BreakIterator;
import java.util.Locale;

public class ChineseWordLength {
public static int countWords(String source, BreakIterator bi) {
int count = 0;
bi.setText(source);
int start = bi.first();
int end = bi.next();
while (end != BreakIterator.DONE) {
String word = source.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0))) {
++count;
System.out.println(word);
}
start = end;
end = bi.next();
}
return count;
}
public static void main (String args[]) {
String str = "\u9700 \u8981 \u5132 \u5b58 \u7d00 \u9304";
BreakIterator wi = BreakIterator.getWordInstance(Locale.CHINESE);
System.out.println("No of words: " + countWords(str, wi));
}
}

When the string is "\u9700 \u8981 \u5132 \u5b58 \u7d00 \u9304", the program counts 6 words. But when the string is "\u9700\u8981\u5132\u5b58\u7d00\u9304" (without white space), it counts as 1 word.
I thought the Java internationalization package
handles the whitespace automatically ?
 
Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I believe the reason has to do with the way the BreakIterator was meant to be used. It is supposed to be used to help people writing word proccesing logic so that they can skip to the next character, word, sentence, etc. In order for the character instance and the word instance to have any separate meaning in chinese characters, the word instance looks for a white space. The word instance was designed to be used for double-click selection which requires everything to be selected between white spaces. This was reported as a bug for the Katakana character set and was rejected by Sun as being the correct behavior of the BreakIterator.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!