Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Wanted:Help converting utf-8 to XML entities

 
Siegfried Heintze
Ranch Hand
Posts: 408
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to write a groovy script (preferably a one-liner!) that will accept an input file as stdin and output as stdout and converts from utf-8 to XML entities in the &#dddddd; format optionally perform the reverse operation too.

Support for UTF-16 and UTF-32 in addition to UTF-8 would be nice too.

Is this really a groovy scripting question or a "which JVM library do I use" question?

Thanks,
Siegfried
[ August 25, 2008: Message edited by: Siegfried Heintze ]
 
Gregg Bolinger
Ranch Hand
Posts: 15304
6
Chrome IntelliJ IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to write a groovy script (preferably a one-liner!)

I can write 10,000 lines of java code on one line. That doesn't mean its better. So lets worry about line #'s when it matters, which is rarely.

Is this really a groovy scripting question or a "which JVM library do I use" question?

Might not be either. Are you asking for help or for someone to do this for you? If its the former, what have you tried so far? What specifically is giving you problems?
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you want everything converted to character entities, or only those characters outside of US ASCII? The latter seems improbable; the latter can be achieved by doing an identity XSL transformation and forcing the output encoding to be US-ASCII.

Not sure how you would do this in Groovy though.
 
Matthew Taylor
Rancher
Posts: 110
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I may be able to help you if you gave examples. What do you mean UTF-8 to XML?
 
Siegfried Heintze
Ranch Hand
Posts: 408
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK, here are some more specific questions:

(1) Is there a library function that will read UTF-8 (or UTF-16) from a disk file and convert each character, including the multi-byte sequences, into a single value (perhaps an integer or a java char or java string element -- although the latter two would not work so well for UTF-32). Ideally this function would read an entire record or file into a string, I think.

(2) How would I implement a regular expression to perform a search and replace to look for all the characters between 0x7f and 0xffffff and convert cast them to an integer and the use toString() to get their representations using ASCII digits '0'-'9' and (finally) prepend "&#" and append ";".

(3) How would I write a script to perform the inverse operation: read an ASCII XML document, search for all the patterns "&#([0-9]+);" and replace the first group with a single wide charcter and write it to a UTF-8 or UTF-16 stream? Is there a library function for writing UTF-8 or UTF-16 representations of my wide character strings? Is there a library function to write UTF-32? I wonder what it would accept (an integer array, perhaps?).

Thanks!
Siegfried
 
Matthew Taylor
Rancher
Posts: 110
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is an example of how read a file directly into text:



And here is how to replace a regex in a string (see here):



So you can combine these (in one line ) like this:



Of course, I have left the real regex work for you. This will only find simple numbers, not hexadecimal numbers.

[ August 30, 2008: Message edited by: Matthew Taylor ]
[ August 30, 2008: Message edited by: Matthew Taylor ]
 
Siegfried Heintze
Ranch Hand
Posts: 408
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks! That is close.

How do I search for all the characters whose ordinal values are are greater than 128 and replace them with the digit sequence from toString?

The above code searches for ordinal values are between 48 and 57 which is not quite what I want.

Thanks!
Siegfried
 
Matthew Taylor
Rancher
Posts: 110
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Siegfried Heintze:
How do I search for all the characters whose ordinal values are are greater than 128 ...


Like I said, I'll leave the regex to you. You need a regular expression (the first parameter to the Groovy replaceAll() method on String) that matches the hexadecimal values you want. This isn't a Groovy question, but a regex question, and I'm no regexpert (sorry about the bad pun).

Originally posted by Siegfried Heintze:
... and replace them with the digit sequence from toString?

The above code searches for ordinal values are between 48 and 57 which is not quite what I want.


In the 'replaceAll()' method of the Groovy String, you specify within the closure what to do with the matched value. So when I have val = "pre-${val}-post" in the closure for replaceAll, that means I'll be taking every match, and adding "pre-" to the front and "-post" to the back of it before putting it back into the original String. There is no messing with toString() anywhere.

Maybe I should clarify that the above code assumes the input string has hexadecimal values representing UTF characters, not the characters themselves. So all this regex replacing assumes that you're looking for the actual '0xFFFFFF' type value in the input string and replacing it with something else like '�xFFFFFF;'.

Also, the regex in my code above matches any string of digits. I don't know where you are getting '48 and 57'.
 
Marc Peabody
pie sneak
Sheriff
Posts: 4727
Mac Ruby VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Code with test data:
String test = (('!'..140) as Character[]).join()
test = test.collect{ (it>128)?"&#${it as Integer};":it }.join()

Result: "!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€&#129;&#130;&#131;&#132;&#133;&#134;&#135;&#136;&#137;&#138;&#139;&#140;"

*blows smoke from gun*
[ September 03, 2008: Message edited by: Marc Peabody ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic