Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

charset conversion CP1252 to UTF-16  RSS feed

 
swapnel surade
Ranch Hand
Posts: 129
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

Im using i-net PDF Content Comparer v1.10. after comparison when i try to read the difference string.
That string is having CP-1252 char format. but java recognize only utf format in this process I'm losing the characters.
What is the correct way to conversion from CP-1252 to UTF-16 or UTF-8 without losing the chars.

Thanks
 
Paul Clapham
Sheriff
Posts: 22374
42
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, no, a String doesn't have an encoding or a charset. An array of bytes (or something like that, like a file) will have a charset, if it represents text, but when you convert that array to a String you interpret according to some charset. If you don't specify one, then your system default will be used. Likewise when you convert a String to bytes, you will again be using a charset.

So your question is not on the right track. Perhaps you could post some code, if you can't figure out where the incorrect encoding or decoding is taking place?
 
swapnel surade
Ranch Hand
Posts: 129
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

When i get the string its look like this
1st string : Text "‐000001875‐0/000" was changed to "‐000001893‐0/000"
but when i print or use this string for comparison its look like this
2nd string : Text "?000001875?0/000" was changed to "?000001893?0/000"

I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.

When i convert this string into UTF-8 or 16 then special character is converted to '?'

I should get hyphen in second string.

Following is the code snippet


In above code when i get value from getDescription() method, I'm getting the special char. but when i used the getBytes("CP1252")
in that byte array its converting that special char into ?

am i using wrong charset ?


 
Ireneusz Kordal
Ranch Hand
Posts: 423
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
swapnel surade wrote:

I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.


Please post a char code of this 'hyphen'.
Isn't it a 'hyphen' copied from the MS-Word document using copy-paste ?
 
Paul Clapham
Sheriff
Posts: 22374
42
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would just throw both of those lines of code away.

The first line says: Convert this string to bytes using the CP-1252 charset.

The second line says: Convert these bytes to a string assuming that the UTF-8 charset was used to encode the bytes.

So clearly the second line is going to cause trouble, because it's using an assumption which is false. The way to fix that is to just leave the string alone and not do either of those lines of code.
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!