First off, you need to be sure you know what format you're using here. You say it's \uXXXX with four digits, but your example consistently uses three digits. So I know it's not the same format as a
Java unicode escape, which always uses four, but I'm not sure what format it really
is. Are there
always three chars in the sequence? Are they treated as hexadecimal? Is there an escape sequence for a plain \, e.g. \\?
Here's some code which assumes that a valid \u escape is followed by exactly 3 hexadicimal characters, and that \\ is an escape for \:
If you're not familiar with regular expressions, now's a good time to learn. The standard reference is
Mastering Regular Expressions by Jeffrey Friedl; you may also want to check out
Real World Regular Expressions with Java 1.4 by our own Max Habibi when it's released (soon I imagine). Or just study the java.util.regex API very carefully; that worked pretty well for me until I finally got around to reading Friedl.
Note that all the multiple \\ sequences can be confusing - the javac compiler uses this as an escape, and so does the regex package, and now so does the format you're parsing. So to represent an escape sequence of \\ in the HTML, the regex engine needs to see \\\\, which means javac needs to see a
String literal with \\\\\\\\. Confusing at first, but it works.
[ December 28, 2003: Message edited by: Jim Yingst ]