• Post Reply Bookmark Topic Watch Topic
  • New Topic

Unicode: cmd parameters (main args); exec parameters; filenames  RSS feed

 
Robert Grampp
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi there,

Java is said to be "unicode ready". That's relatively easy as long as I stay within the JVM.
It gets difficult at the interfaces to the shell (main/exec) and the file system.
Unfortunately, after long google researches, I didn't find convincing resp. reliable results.
I would appreciate very much if you could help me on this questions:
A Java application must be capable of handling international characters (not only ISO-8859):
1) in command line parameters for the Java application
Is java's main() generally 16 bit capable (multibyte-character, cf. C++'s wmain)?
With shell I mean cmd.exe (up from Win XP Prof.).
The codepage of the Windows cmd console I set to Unicode:
chcp 65001
When calling

java -Dfile.encoding=UTF-8 UnicodeTests ����������

I put out args[0] and can see those characters in the cmd shell correctly.
Unclear is to me: here I only have ISO8859-1 characters resp. Windows-1252. What about a chinese Windows, where not all language characters fit into 1 byte?
Can the cmd shell generally pass more than 1 byte per character over the Java main()???
Can anyone type 2 byte chars in the cmd?
I assume, per codepage switching (chcp) I only can switch to another 1 byte window within the Unicode character set (???)
Do I need any workaround?

2) international characters in parameters for external programs to be called (exex / ProcessBuilder):
What do I have to pay attention for?

3) Finally: how do I have to handle files with Unicode characters (or other encodings) in the filename?
With which encoding works File(String pathname), File.list() ?

Thanks a lot in advance!

Regards, Robert
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello "fux"-

Welcome to JavaRanch.

On your way in you may have missed that we have a policy on screen names here at JavaRanch. Basically, it must consist of a first name, a space, and a last name, and not be obviously fictitious. Since yours does not conform with it, please take a moment to change it, which you can do right here.

As to your question, Java is more than "Unicode-ready" - it uses Unicode internally for all strings (UTF-16, to be precise). All external character data being transferred into the JVM is converted to Unicode using the encoding specified by the "file.encoding" property. That applies to command line parameters as well as to files. It's possible to use a different encoding for some operations by specifying that especially, e.g. for reading files one might use one of the two-parameter InputStreamReader constructors.

Does this answer your questions?
 
Darryl Burke
Bartender
Posts: 5167
11
Java Netbeans IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This poster had posted the same topic on the Sun forums and been blocked for using an offensive nic.

db

edit I'm happy to see that the offensove nic is no longer displayed here.
[ October 13, 2008: Message edited by: Darryl Burke ]
 
Robert Grampp
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Ulf,
yes I missed that policy. I corrected to real name. I know it as usual to usual to use nicknames in forums.

The other thing - yes, I posted this question also on Sun forums. Is this a really a flaw? - it's another website.

Maybe that was a bad beginning. Anyway I appreciate your help.

I really researched a lot before asking. I see it's hard to find the requested info via search engine. Lot's of posts are about principle understanding faults of encoding subjects (or not even knowing about "Unicode").
If experienced users read "Unicode" in the subject bar, they might assume, it again goes about such basic things.

Well, sure it's not only about java but it also interferes with other programs. It also is about interplay with the cmd shell and their capabilities.

I read this on http://forums.java.net/jive/thread.jspa?messageID=21360 and wonder if this is only due to the users wrong usage:
"... some of the Unicode problems in J2SE 6.0 (downloaded around February 2005).
* We can't pass Unicode (Wide Character) command-line arguments to the java launcher.
* We can't pass Unicode (Wide Character) command-line arguments to the processes launched by java (e.g. through java.lang.ProcessBuilder class)
"

It's also about how to test all this (I don't have a chinese Windows nor a chinese colleague that could operate it to support my tests) and to differentiate on which applications side I have to tune. I just realized that I can't put in characters outside the Windows-1252 set in the cmd shell.

Background:
We have an enterprise software application that has command line clients and that has to call external programs passing string parameters that may contain any international texts.

So again the question: are you sure, there can be passed wide characters from Windows cmd into Java main?

Thanks for your help!

Regards, Robert
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I corrected to real name. I know it as usual to usual to use nicknames in forums.

It's fine to use fake names, it just needs to look like it could be a real name. (Not sure what you're afraid of using your real name, though, but let's not get into that here.)

The other thing - yes, I posted this question also on Sun forums. Is this a really a flaw? - it's another website.

It's the polite thing to do: BeForthrightWhenCrossPostingToOtherSites
There's no sense to spend time answering on one site when there's an answer on some other site already.

So again the question: are you sure, there can be passed wide characters from Windows cmd into Java main?

I'm not exactly sure what wide characters are (some form of 16bit characters used on Windows, I guess), but yes, I'm sure that Java can handle anything that's thrown its way in a number of encodings. It's crucial to tell the Java code what encoding that is, of course, and you're only guaranteed US-ASCII, ISO-8859-1, UTF-8 and UTF-16. Everything else is optional, so I wouldn't rely on any particular JRE to be able to do that. But Unicode should cover all the Chinese bases. Of course, UTF-16 is subject to byte ordering problems, so it's best to steer clear of that as well. Which leaves you with UTF-8. How you'd get a Windows (or Unix) shell to pass arguments in UTF-8 -if they even support that- I have no idea. If that was my problem I'd probably move that data to a file, and pass the file name as part of the command line.
 
Robert Grampp
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ulf Dittmer:

(Not sure what you're afraid of using your real name, though, but let's not get into that here.)


Hi Ulf,
I already wrote down a reason for that but deleted it, hence not to start that discussion.


There's no sense to spend time answering on one site when there's an answer on some other site already.

Oh, good point. I didn't think of this aspect. Well, it is a topic where I didn't expect to get a clear "final" info resp. solution. Thus I thought of gathering some part of info from that forum's users another part from another forum's users.


But Unicode should cover all the Chinese bases.

Unicode does, but Java doesn't cover all Chinese chars _directly_, since 2 bytes are not sufficient for all chars, now the Unicode spec uses 3 bytes to represent some special chinese chars (CJK Unified Ideographs (Unihan)).
Just for info for the readers:
Therefore in Java 1.5 they introduced "Supplementary Characters" with new methods and adapted methods (e. g. indexOf() considering 3 bytes chars within a Java string) to handle them.
I only have a link for the german readers about that item:
http://itblog.eckenfels.net/archives/17-Java-und-Unicode.html

How you'd get a Windows (or Unix) shell to pass arguments in UTF-8 -if they even support that- I have no idea. If that was my problem I'd probably move that data to a file, and pass the file name as part of the command line.

Ulf, I also thought of that. But as well as we US/European users are used to put ASCII/Windows-12/ISO-8859-1... characters as parameters in the commandline, maybe asian people are also used to do that with their usual character set and so also expect this to be supported by our command line client.
But still I don't know if Asian people do that / can do that inputs.

So far,
Regards, Robert
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Java doesn't cover all Chinese chars _directly_, since 2 bytes are not sufficient for all chars

Yes, but the important point is that UTF-8 supports characters beyond the BMP, and that importing UTF-8 containing 24bit and 32bit characters into Java will work just fine. The Java code may then have to account for those for some operations (like counting characters). A couple further links about that can be found in the http://faq.javaranch.com/java/JavaIoFaq but since you're aware of the problem that's probably nothing new.
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I thought I gave a reasonably full answer in the Sun forum but the forum police objected to your handle so my response is lost.

<quote>
"Unicode does, but Java doesn't cover all Chinese chars _directly_, since 2 bytes are not sufficient for all chars, now the Unicode spec uses 3 bytes to represent some special chinese chars (CJK Unified Ideographs (Unihan)).
Just for info for the readers:"</quote>

All Java Strings are UNICODE encoded as UTF-16. UTF-16 does cover ALL UNICODE characters by the use of two character code points to cover characters with a UNICODE value greater than 65536.

I can't be bothered to re-write the rest of my response in the Javasoft forum. I will just say that when command line parameters are passed to the main() program they are converted to Java Strings using the default platform character encoding. When using Runtime.exec(), Java strings are passed as parameters they are converted using the default character encoding.
[ October 13, 2008: Message edited by: James Sabre ]
 
Robert Grampp
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by James Sabre:
All Java Strings are UNICODE encoded as UTF-16. UTF-16 does cover ALL UNICODE characters by the use of two character code points to cover characters with a UNICODE value greater than 65536.


Hi James,

you're right. When trying quickly to review a long article I summarized it the wrong way. For some CJK chars the first byte (High Surrogate Area) selects one of the "planes" in the Unicode charset above 0xFFFF (that was want I had in mind with 3 bytes). ...
I don't wanna repeat it completely here cause that's not the topic.

I will just say that when command line parameters are passed to the main() program they are converted to Java Strings using the default platform character encoding. When using Runtime.exec(), Java strings are passed as parameters they are converted using the default character encoding.


Cmd has another encoding (cp850) than the platform
(file.encoding (default: Windows: Cp1252, Linux: ISO-8859-1)
So, how do I pass the arguments to the command, i. e. how do I supply the correct encoding for the command params, I think I should do some kind of recoding the string?
It's not a good idea to start the JVM with -Dfile.encoding=cp850, then all outputs were in the wrong encoding.
I just know how to read the command output:
InputStreamReader isr = new InputStreamReader(is, "cp850");




(Some encommented tests enclosed)

Regards, Robert
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Cmd has another encoding (cp850) than the platform
(file.encoding (default: Windows: Cp1252, Linux: ISO-8859-1)


Most modern Linux distribution have utf-8 encoding as the default, not ISO-8859-1. Certainly the four I have access too have utf-8 as the default. The default character encoding for Windows depends on how it is configured.

So, how do I pass the arguments to the command, i. e. how do I supply the correct encoding for the command params, I think I should do some kind of recoding the string?


There are two aspects to this. First, I would expect the command line arguments passed to a ProcessBuilder Process will need to have the same character encoding as the default since this is the way they would be used from the command line. Since Java will convert using the default character encoding unless told otherwise there is nothing to do.

Second, when sending input to the Process stdin you will need to know what character encoding the process expects and setup any java.io.OutputStreamWriter according. Similarly with reading from stdout and stderr, you will need to setup the java.io.InputStreamReader according to the character encoding that the Process is expecting. These encoding may or may not be the default character encoding.

I rarely provide character input to process executed though Runtime.exec() or ProcessBuilder. I tend to write binary data so no character encoding is required.


It's not a good idea to start the JVM with -Dfile.encoding=cp850, then all outputs were in the wrong encoding.


Quite!


I just know how to read the command output:
InputStreamReader isr = new InputStreamReader(is, "cp850");


Only if the process is expecting input encoded as "cp850"!

Your handling of stdin, stdout and stderr looks suspect. You really should read, read again and implement the recommendations in http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html .
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!