• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Paul Clapham
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Roland Mueller
  • Piet Souris
Bartenders:

any hints for creating &/or using existing UNICODE convertor/processor?

 
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
For ex., I would like to convert UNICODE codes to ASCII and other encodings
 
Ranch Hand
Posts: 73
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My Friend,
you can easily convert a String to a diferent CharSet using:
String p = "Any string";
p = new String(p.getBytes("UTF8"));
I would like to know how to check what is the current CharSet used by the JVM.
 
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
just check out if this works

String charset = response.getCharacterEncoding();
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Alex and Sheril,
I know how to program, I asked, really, abt ready processor with such functionality.
My name is not Friend, it, the name, usually appears at the left sidebar and/or under the post.


String p = "Any string";
p = new String(p.getBytes("UTF8"));


This is not correct, abt any string. Try ANY symbol in Cyrillic, to see. JVM uses UNICODE, i.e. 2-bytes/symbol. You may directly write your program in unicodes and it is the same for javac!
[ October 29, 2002: Message edited by: G Vanin ]
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Come on, Guennadii - "my friend" is a simple friendly greeting. It doesn't imply your name is "friend" any more than it implies your name is "my". Will you object to "you" next? Please, lighten up. No offense was intended.
Java does have built-in functionality for this, and getBytes() is part of it. Unfortunately the example shown is incorrect - a better one would be:

The problem in the original code is that while getBytes() converted Unicode to UTF-8, the new String(byte[]) constructor (probably) did not use UTF-8. Instead it used the default encoding on your system - whatever that may be. On Windows systems in the Americas and Western Europe it's usually Cp-1252 (Windows Latin-1).
You can also use an OutputStreamWriter to convert Unicode to other encodings, and an InputStreamReader to convert other encodings to Unicode. See the constructors which accept a String encoding argument, or a CharSet (in 1.4).
I would like to know how to check what is the current CharSet used by the JVM.
Annoyingly Java doesn't seem to directly provide this info. Sheril's response tells you how a server is configured to respond to HTTP requests, which isn't necessarily the same thing. (And what if you're not even running a server?) The best workaround I have to find the system default encoding is:

[ October 29, 2002: Message edited by: Jim Yingst ]
 
"The Hood"
Posts: 8521
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by G Vanin:
the name, usually appears at the left sidebar.


Except yours of course. :roll: He may not have want to call you G and most people do NOT put their name on the bottom of a post.
You COULD change your display name . . . . .
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jim,
I believe that characters in Java are treated as 16-bit UniCode characters (Reader, Writer). They do not depend on particular encoding format because they are in UNICODE.
I have a notion that Latin-1 encoding (that used in US, Europe) is “8859_1” (ISO 8859-1)(Latin-1 may be get/verified by calling

System.getproperty(“file.encoding” ;)


). Anyway it is not UTF.
It is always possible to enforce another encoding conversion through byte streams (InputStream, OutputStream). There are bridges between bytestreams and character-streams: InputStreamReader, OutputStreamWriter. They are character stream objects (Reader or Writer) that take byte streams (InputStream or OutputStream), as well as, possibly in addition, “encoding”. OK.

response.getCharacterEncoding();

Character, produced outside of Java, encoding certainly may be in any encoding. This depends on OS, application and/or its configuration and even on processor. Who knows the origin of our streams (is it a file created in Taiwan, sorry in China?) O-o-o-h, I did not intend to discuss anything of this (please refer to my original question. Anyway I repeated it in https://coderanch.com/t/113264/HTML-JavaScript/If-anybody-knows-any-text)
What I could not get from all your deviations, sorry arguments:
1)what are those UTF8 examples all about? Why UTF8?
2)Can you explain me abt Cp-1252? What is its relation to Latin-1?
[ November 06, 2002: Message edited by: G Vanin - just changed <blockquote> to

]
[ November 06, 2002: Message edited by: G Vanin ]

 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Cindy and Jim,

most people do NOT put their name on the bottom of a post.

then somebody ALWAYS put it on the left.
Sounds like my name was pages away from 1-line post. See more above

Except yours of course


That hurts. Where from comes such a terrible and unfair suspicion abt forgering my names?.

You COULD change your display name . . . .


<b>NEVER</b> I already tried once, and after that Javaranch adds a “greenhorn” to my names. I’d rather prefer My Friend.
My names are already automatically augmented by a “member” in each post.. Ask Russians what translation does mean literally.
Then, how would my fans, from bartenders, find me (to call me ”jerks” without capitalization and “idiots as always” also without any proper capitalization), if I start changing my names?

He may not have want to call you G


Very nice and clever of him but just at the level of 4-5 line there were 2 complete names at choice. And any intelligent one would have understood that “G” is just a letter/abbreviation for first name Guennadii but not, in any possible way, “G” is the name. The last name is also found easily after some investigation, it is just the last amongst more than one (Vanin).
The reason that I abbreviated “G” is my experience that latins (peoples to the South of Europe, and South of North America) has enormous difficulties pronouncing and remembering something that should be pronounced as [g] as in goose before “e”, since it never happens in their languages, e.g. (in Portuguese). Then I started adding “Guennadii” underneath since some bartenders here do not understand the difference between an abbreviation, i.e. a letter, and a name, i.e. something with more than one letter. .

"my friend" is a simple friendly greeting


Even if to forget about capitalization of the words, never before did it happen to me in a friendly context or with friendly intentions.
In any language there are more appropriate terms even in situations when the name is unknown.
Usually it is used in sarcastic approaches
[ October 31, 2002: Message edited by: G Vanin ]
 
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
To answer some questions:
1) The platform default encoding can be obtained by System.getProperty("file.encoding")
2) Cp1252 is the Windows character set (code page 1252). It's very similar to ISO-8859-1 (Latin 1), but is not identical. For eg, in Latin 1, characters in the range 128-159 are control characters (non-printable). However 1252 assigns some printable characters (such as the TM symbol) to codes within this range. So CP1252 is kind of like a superset to Latin 1.
3) I think there is some confusion here regarding "character sets" & encodings. UTF-8 is simply an encoding scheme (i.e represent characters as a sequence of octets), and is used to encode the Unicode "character set". Character sets on the other hand is a mapping between "charaters" and "character codes"/integers. Additionally a charcter set may specify an encoding scheme(s). For eg, Unicode has UTF-8 and UTF-16.
Also it's important to remember that beyond US-ASCII (i.e character codes > 127), all other encoding schemes (like Latin 1) are incompatible with UTF-8. This is because UTF-8 is a multi-byte encoding scheme which encodes characters like:
One byte 0xxxxxxx - 0 indicates single byte used for encoding the character
2 bytes 110xxxxx 10xxxxxx - 11 in the first byte indicates 2 bytes used for encoding
3 bytes 1110xxxx 10xxxxxx 10xxxxxx - 111 indicates 3 bytes used for encoding the character
On the other hand, Latin 1 encoding is pretty straight-forward....simply write out the character code value (1 byte). Thus for character codes 128 - 255, Latin-1 uses 1 byte for encoding while UTF-8 uses 2 bytes.
To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set. :roll:
[ October 31, 2002: Message edited by: Junaid Bhatra ]
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Even if to forget about capitalization of the words, never before did it happen to me in a friendly context or with friendly intentions.
Well, it has happened to me. Many times. Perhaps that's because you insist on interpreting friendly gestures as unfriendly, until no one bothers trying to be friendly to you? Sure, some people may be being sarcastic when they say "my friend" - but many people are not. Why assume the worst?
That hurts. Where from comes such a terrible and unfair suspicion abt forgering my names?.
No one is suspicious that you forged your name. Cindy simply observed that the name you prefer to be called, is not actually what's listed to the left of your post. It's a mystery to us why you continue to confuse people this way and then complain about it later.
I already tried once, and after that Javaranch adds a �greenhorn� to my names. I�d rather prefer My Friend.
You don't need to register a new account. Just go to "my profile" -> "View/Edit Profile" and change the "Publicly Displayed Name".
Now note that even after you change your name (if you do) it's still possible that someone might call you "my friend" without sarcasm, and it would probably be a good idea if you tried not to be offended for no good reason. You might even realize that people here have in fact been trying to help you, and saying "thank you" occasionally would be nice. Even if your question is not yet answered to your satisfaction, people have been trying. For some reason. :roll:
what are those UTF8 examples all about? Why UTF8?
Sheril chose UTF-8 for his/her example, and I simply continued it. It's a very common encoding, but you can use most any other encoding you want in much the same way. (Assuming it's one that Java understands.)
 
whippersnapper
Posts: 1843
5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Not to turn this into yet another Russian vs. English thread...
My names are already automatically augmented by a “member” in each post.. Ask Russians what translation does mean literally.
"Member" has the exact same dual meanings in English as it has in Russian. Somehow we all survive.
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Junaid and Jim,
thanks for your effort. It is useful for CP-1252. Though I certainly need some link to more exhaustive text to close the subject.
I always imagined any data as bits “0” and “1”, that are eventually integers, in my head (your “octets”?). Everything is integer to me, there, inside PC. There are also such terms as “format”, “template”, “representation”. Both digits and characters are, after all just glyphs, graphical representations according to encodings/formats.
UTF is just the way to save bandwidth, since most of symbols find themselves inside ANSI-ASCII and, therefore, avoid the waste of second byte. :Java uses UNICODE, I believe. And UTF...Honestly speaking, it is in this discussion that I encountered the need to know abt it.
0) I would like to know how it comes to practical use, to be chosen.
1)I had not been aware abt existence of UTF-8 and UTF-16. Any further comments or links?
2) As a matter of fact, I also did not know that the second/third bytes in 2/3-byte sequences start with “10”. What is its (of “10” ;) function (why they may not be arbitrary, if they are to be ignored?=? i.e. why “110/”1110” is not sufficient?
3)

To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set.


This is a great pain to me. I use access through the library and save-as some pages. And you know, it is strange but at home PC (English versions of WindowsXP) I visualize OK pages in Russian but not in German. But access to Internet, in library, is through MS IE5 in Portuguese. The Windows 2000 are sometimes in English, sometimes in Portuguese. They use CP-1252
For ex., I cannot visualize (german) pages from
http://www.bild.t-online.de
Pages have hundred of KBs but do not show anything. I tried changing encodings in View - no effect.
[ November 06, 2002: Message edited by: G Vanin ]
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Michael,
honestly speaking, I thought I am in Colorado.... communicating with Mexicans. Strange that it is happened to be in England. Can you give me the links to the mentioned threads?
This is the end of discussion, not the beginning. You are late
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Something I overlooked earlier...
1) The platform default encoding can be obtained by System.getProperty("file.encoding")
I would note that while this does work on many systems, it's not actually guaranteed by the API. See list under "getProperties()" here.
And more recently...
0) I would like to know how it comes to practical use, to be chosen.
1)I had not been aware abt existence of UTF-8 and UTF-16. Any further comments or links?

Well, "UTF" by itself is ambiguous - it may refer to either UTF-8 or UTF-16. (Or other more obscure variants.) Basically UTF-8 is designed as a reasonably simple encoding which is efficient for western european languages - typically requiring only 1 byte for most characters. The down sides are that since it's variable-length it's more complex to parse, and most Asian languages end up requiring 3 bytes per char. UTF-16 on the other hand is simpler, and all chars require two bytes. (Unless you want to use Unicode values above 0xFFFF, which are almost never needed by most of us and require more complex handling in Java, which I don't understand very well.) So generally UTF-8 is preferred in the west, and UTF-16 in asian countries. (Or some other encoding more specific to a given language, like Shift-JIS for Japanese.)
On reflection, maybe I should have just provided these two links:
http://www.wikipedia.org/wiki/UTF-8
http://www.wikipedia.org/wiki/UTF-16
I'm not sure if that's what you we asking, but you can always Google for more links.
2) As a matter of fact, I also did not know that the second/third bytes in 2/3-byte sequences start with �10�. What is its (of �10� function (why they may not be arbitrary, if they are to be ignored?=? i.e. why �110/�1110� is not sufficient?
It's an easy way to tell if a particular byte is the start of a multibyte sequence or not. If you're writing an encoder or decoder and you commit some sort of off-by-one-byte error, it's easier to detect the error this way.
To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set.
Similarly java.nio.charset.Charset in JDK 1.4 really represents an encoding, not a character set.
http://www.bild.t-online.de
Pages have hundred of KBs but do not show anything. I tried changing encodings in View - no effect.

The site comes up fine for me - encoding us ISO-8859-1. I suspect this has nothing to do with encoding issues, but rather with the fact that you're accessing from a library in the US. The website is a little racy by US library standards - it has some nudity (even if that's not the primary focus), and so the library probably has it blocked. You may be interested in David O'Meara's suggestion in this thread.
[ November 06, 2002: Message edited by: Jim Yingst ]
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jim,
thank for all that stuff (I saved-as and shall study later).
Meanwhile that problem of visualizing after save-as reproduced by others, see in "HTML and Javascript"
https://coderanch.com/t/113280/HTML-JavaScript/Cannot-open-pages-After-saving
so, it is weird
 
Trailboss
Posts: 24082
IntelliJ IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Some people seem to find the best possible interpretation of a message and are happy to get any response to a question.
Some people seem to find insult and injury in almost any message.
Some people seem to have fun and have a good time wherever they go.
Some people seem to be cranky all day long, every day.
There are over six billion people in the world. There's no reason to spend any time talking to cranky people.
 
Guennadiy VANIN
Ranch Hand
Posts: 898
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Paul,
that's a challenge: to talk to 6 billion
 
reply
    Bookmark Topic Watch Topic
  • New Topic