• Post Reply Bookmark Topic Watch Topic
  • New Topic

Translate between Character Sets?  RSS feed

 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a client who wants to translate between two character encodings.

(https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html)

She has given me a before and after of what the text should look like, and said "Latin" and "Hebrew", but I haven't been able to get any of the encodings above to work.

Here's the "Before" text: Abraham Moses Krikêr

Here's what I should see after translation: Abraham Moses קריקיר

---

Here's my code so far...(not giving correct output):



Also, sometimes, depending which character set I'm trying, I just garbage characters that aren't recognizable. Not sure how to work around that.

Suggestions would be welcome...these character encodings are totally foreign to me, literally.

TIA

- mike
 
Paul Clapham
Sheriff
Posts: 22844
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In line 3 you convert a String to bytes using your system's default charset -- whatever that might be.

Then in line 5 you convert those bytes to a String, telling the system that the bytes were encoded in CP862 -- which they weren't, because your system's default charset isn't CP862, I'm pretty sure.

So already you're on the wrong track.

It doesn't make sense to convert between two charsets unless you're given a sequence of bytes (an array, a file, whatever) which have been encoded in charset A, and you want to convert those bytes to a second sequence of bytes which have been encoded in charset B. So, starting from a String is not what you should be doing, and you shouldn't expect to end with a String either.

What I'm saying here is that charsets aren't the solution for what I think your problem is. If your client used that terminology, then I think they have the same misunderstanding as the code you posted does. You also used the word "translation" which isn't right either -- I think "transliteration" is what you want. You'll need some code which converts "K" into the corresponding Hebrew letter and so on. Or it might be more complicated than that if the rules of Hebrew letters don't match the rules of English letters -- I know that Hebrew is usually written from right to left but that's more of a formatting issue so it shouldn't apply to what you need to do.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:In line 3 you convert a String to bytes using your system's default charset -- whatever that might be.

Then in line 5 you convert those bytes to a String, telling the system that the bytes were encoded in CP862 -- which they weren't, because your system's default charset isn't CP862, I'm pretty sure.

So already you're on the wrong track.

It doesn't make sense to convert between two charsets unless you're given a sequence of bytes (an array, a file, whatever) which have been encoded in charset A, and you want to convert those bytes to a second sequence of bytes which have been encoded in charset B. So, starting from a String is not what you should be doing, and you shouldn't expect to end with a String either.

What I'm saying here is that charsets aren't the solution for what I think your problem is. If your client used that terminology, then I think they have the same misunderstanding as the code you posted does. You also used the word "translation" which isn't right either -- I think "transliteration" is what you want. You'll need some code which converts "K" into the corresponding Hebrew letter and so on. Or it might be more complicated than that if the rules of Hebrew letters don't match the rules of English letters -- I know that Hebrew is usually written from right to left but that's more of a formatting issue so it shouldn't apply to what you need to do.


Well, at least I'm on the wrong track right away. :)

The code does do a "getBytes". I found that code online, but there wasn't much I could find in any of my books or online examples how to approach this problem.

Perhaps I'll just declare defeat now.

- mike
 
Paul Clapham
Sheriff
Posts: 22844
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It looks to me that you want to convert the Latin letter K to the Hebrew letter Kaf, and so on. Is that right? To do that, it goes like this:



You'd have to do that for every letter, even assuming that a straight letter-to-letter replacement is sufficient. (For example I notice that the Hebrew letter Nun has a different form to use at the end of a word, and it's a different character, so maybe it's more complicated than that.)

 
Matt Wong
Greenhorn
Posts: 23
3
MS IE Notepad Suse
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, usage of System.out leads to you're using terminal, wich on Windows is default to ansi-1/win-cp1251 or some local cp depend on windows version and locale settings (I assume Windows here cause most terminals on linux desktops are able to use unicode correctly).
The "garbage" is caused by your console misses cp and font to creectly render java unicode output.
So you should try a gui approach like swing or fx.
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, even if the console used a font containing every character our there, it would still appear as garbage because Mike's interpreting characters serialized using one character set, with a completely unrelated character set.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Well, even if the console used a font containing every character our there, it would still appear as garbage because Mike's interpreting characters serialized using one character set, with a completely unrelated character set.


Appreciate all the responses!

I think it's probably better to try and manage this problem from the output side (in this case, a PDF) rather than the raw data side since it may not even be a character-to-character mapping, assuming I could figure that out.

Thanks again.

- mike
 
Matt Wong
Greenhorn
Posts: 23
3
MS IE Notepad Suse
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried to think about this problem from another point, got another idea: Do different characters from different languages have the same meaning?
For one easy quick example: There're some letters looking the same in latin and kyrillic alphabets, but they have different meanings. Best known: C - that's 'c' in latin, but read as kyrillic it'a 's'. Another one is P - wich is 'p' inlatin but 'r' in kyrillic. So the famous CCCP is read as 'c', 'c', 'c', 'p' - but if you read it in kyrillic it comes out as 's', 's', 's', 'r'. But - even these are all latin letters - the "original" kyrillic ones have different meanings, order and code-points. So it's not as simple as to print latin letters with a kyrillic font - but you have to language-translate them.

And my guess here is, even may you have a possible way to correctly render each character - are they represent the correct information value? So you don't only have to kind of re-map them to different code-points by a charset conversion - but you somehow have to also translate the information a group of characters, kniwn as 'word', represents.

So you could very quickly get to some limitations of java here cause javas datatype for characters is unsigned short that rang from 0 to 65535 - not much you can display of all the letters and languages and technical symbols existing in the world.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Matt Wong wrote:I tried to think about this problem from another point, got another idea: Do different characters from different languages have the same meaning?
For one easy quick example: There're some letters looking the same in latin and kyrillic alphabets, but they have different meanings. Best known: C - that's 'c' in latin, but read as kyrillic it'a 's'. Another one is P - wich is 'p' inlatin but 'r' in kyrillic. So the famous CCCP is read as 'c', 'c', 'c', 'p' - but if you read it in kyrillic it comes out as 's', 's', 's', 'r'. But - even these are all latin letters - the "original" kyrillic ones have different meanings, order and code-points. So it's not as simple as to print latin letters with a kyrillic font - but you have to language-translate them.

And my guess here is, even may you have a possible way to correctly render each character - are they represent the correct information value? So you don't only have to kind of re-map them to different code-points by a charset conversion - but you somehow have to also translate the information a group of characters, kniwn as 'word', represents.

So you could very quickly get to some limitations of java here cause javas datatype for characters is unsigned short that rang from 0 to 65535 - not much you can display of all the letters and languages and technical symbols existing in the world.


Thanks for your reply. I'll give that some thought.

- mike
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Matt Wong wrote:So you could very quickly get to some limitations of java here cause javas datatype for characters is unsigned short that rang from 0 to 65535 - not much you can display of all the letters and languages and technical symbols existing in the world.

Java uses UTF-16 which supports the entire range of unicode characters, using what is "surrogate pairs". The problem is that a lot of programmers think one char represents one character, while reality is much much more complex.

Anyway, this is not a problem of charsets or fonts. You need a complex rule engine to transliterate accurately. Note that in some languages, a "simple" transliteration isn't enough. Depending on the function of the name in a sentence, you may also have to decline it in some way. For instance, the Russian name Ю́рий (Yuri) may have to be written as Юрию when used in the dative case (I gave something to Yuri)
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:
Matt Wong wrote:So you could very quickly get to some limitations of java here cause javas datatype for characters is unsigned short that rang from 0 to 65535 - not much you can display of all the letters and languages and technical symbols existing in the world.

Java uses UTF-16 which supports the entire range of unicode characters, using what is "surrogate pairs". The problem is that a lot of programmers think one char represents one character, while reality is much much more complex.

Anyway, this is not a problem of charsets or fonts. You need a complex rule engine to transliterate accurately. Note that in some languages, a "simple" transliteration isn't enough. Depending on the function of the name in a sentence, you may also have to decline it in some way. For instance, the Russian name Ю́рий (Yuri) may have to be written as Юрию when used in the dative case (I gave something to Yuri)


Yep, I fell into the trap thinking it was just a character-to-character translation.

The customer found a better solution to the problem after all -- using a macro in the actual output WP document to process.

Thanks again!!

- mike
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!