Paul Clapham wrote:If your document is really an XML document, then there are two things you should do:
1. Declare an encoding in the XML prolog, preferably UTF-8.
2. Write the XML document using that encoding. That either involves telling standard XML classes that you want that encoding, or if you're using your own BufferedWriter, wrap it around an OutputStreamWriter which uses that encoding.
If you find yourself implementing a solution where your code has to look at each character you're writing out, then you went the wrong way. All of this sort of thing is built into Java somewhere, you just have to find out where.
Peter Cong wrote:I am using word.document XML type, basically, I create a word document and it is xml type. Not a real pure XML document.
Paul Clapham wrote:
Peter Cong wrote:I am using word.document XML type, basically, I create a word document and it is xml type. Not a real pure XML document.
I'm sorry, I don't understand that. It doesn't seem like a beginner topic to me. But I repeat, if you find yourself having to fiddle with each character then you're doing it wrong. If this "word" thing is Microsoft Word then it should be able to handle standard encodings, so I encourage you to find out how to do that.
Peter Cong wrote:I donot know why you are confused, basically, my word file is a word xml type, when you create a word document, you can choose xml type, so my word document is a xml type
No, it isn't. Let's see if I can't move it … to the wrong placePaul Clapham wrote: . . . It doesn't seem like a beginner topic to me. . . .
Paul Clapham wrote:
Peter Cong wrote:I donot know why you are confused, basically, my word file is a word xml type, when you create a word document, you can choose xml type, so my word document is a xml type
I'm confused because I have never heard of a "word xml type". I just created a Word document (this is Microsoft Word we're talking about, right?) a few minutes ago and I didn't get to choose "xml type". You seem to think it's obvious what you're talking about but I've been using Java and Word for many years and I don't know what you're talking about. But then aren't we talking about Java programming here? Are you using some code package which you haven't mentioned? Since you posted in Beginning Java I'm assuming you're writing some simple code without reference to third-party APIs or whatever. So I think it would help if you explained your problem. Perhaps posting some code would help so we aren't completely in the dark.
g tsuji wrote:@Peter Cong
As a stop-gap, have you tried to do a re-encoding data (text) from system's encoding or iso-8859-1 to utf-8 that I guess is the encoding anticipated in the docx ?
g tsuji wrote:In this concrete case, I don't even know what s actually be represented and the result is said wrongly encoded...
Paul Clapham wrote:
g tsuji wrote:In this concrete case, I don't even know what s actually be represented and the result is said wrongly encoded...
Yes, this (for us) is the basic problem. We haven't seen any code so we have no idea what we are dealing with. The guess that it's the input which is being mangled by using the wrong encoding -- yes, that sounds likely to me. But until the OP starts contributing to the dialog we aren't going to get any farther than that.
Peter Cong wrote:The word.xml template have many field holder like: ##UserName##, so the UserName will be replaced with the data from reader, it works except some French words such as "é" are not converted properly, so I create another method XmlEncode with these codes...
Paul Clapham wrote:It's also worth pointing out that we have no idea what "not converted properly" means. If nobody has already linked to our FAQ entry TellTheDetails (<-- link) it would be worthwhile for you to read that and then give us a better description of the problem.
Paul Clapham wrote:Your first post doesn't say that. And telling us that you see "something like this" is unhelpful. What would be helpful would be for you to provide a precise and detailed description of the problem. If you didn't read that FAQ I linked to, please do that before posting.
Paul Clapham wrote:Is there a reason you don't want to provide information about this problem? Remember that you are not in a position to judge what is sufficient information for somebody else to diagnose a problem.
Here's what we know so far:
You write some data to a file. It's processed by some unknown template-handling system. (Or perhaps it isn't -- you have told us nothing about that.) When you load the results into Notepad++ you get data which is not what you expected to see, but neither is it invalid data which would prevent it from being processed successfully as a Word document.
Peter Cong wrote:I did provide detailed source in my previous post, if you read it carefully, I think it should be enough information to resolve the problem.
g tsuji wrote:I would check the general settings like:
[1] What is the source file(s) encoded in?
[2] What is the system property "file.encoding"?
[3] Is the -encoding switch be applied for the compiling with javac?
etc... and then
[4] Also you can set up your ht in the code instead of drawing it somewhere you said from db. See how it behaves.
If things agree well, I don't think you have to replace the character by its numeric entity as you do in XmlEncode().
Also, to mention in passing,
[5] docx is not really some text file properly speaking. It is a zip file, is it not?! There are some framework to do the job properly, like docx4j. Have you looked into that?
g tsuji wrote:Here is what you can test yourself.
[a] With editor as you use notepad++ which should support utf-8, save your source code msWordUtils.java in utf-8.
[b] Save the word_template.xml in utf-8 in agreement with the prolog. As it is all ascii, so you do notice anything special or you do not to do anything special.
[c] Compile with the switch -encoding utf8.
[c.1] As there are unchecked operations, the compiler also suggests you put -Xlint unchecked to see the warnings if needed.
[d] In the source code, you have a couple of changes to make
[d.1] You want to test CUSTOMERNAME, so you do this, obviously. Note that the source is in utf-8. If you watch it in hexedit, it shows up C3 A9 to confirm.
[d.2] You do nothing (as you do not need to do anything as word_template.xml is all ascii in this case) on the reader lines. In case, word_template.xml is not ascii but contains geniune utf-8 characteristic text, you set it up in a similar fashion as for the writer below.
[d.3] Set up the write in a slightly more elaborated fashion. Elaborate further if you think appropriate.
[d.4] Then since writer now does not support write.newLine(), you do it in an alternative way. (Make it more economic yourself, not calling System.getProperty() each time.)
[e] Since I claim you do not need XmlEncode(), it is retired.
That's about it.
g tsuji wrote:Just a final note.
[f] If you deal with xml as text file, it is not useless to know precisely what freedom an xml document is granted to the author of it to be considered it semantically equivalent. That freedom may not be what a text file normally has. Hence, it is always preferrable of it be treated by a xml parser of some kind. But, in this case, maybe we can get away with it as it would be quite unimaginable to have a presumed place-holder to break up into two lines or more. But that is needed to be built into the rule of authoring a template of the kind.
g tsuji wrote:Here is what you can test yourself.
[a] With editor as you use notepad++ which should support utf-8, save your source code msWordUtils.java in utf-8.
[b] Save the word_template.xml in utf-8 in agreement with the prolog. As it is all ascii, so you do notice anything special or you do not to do anything special.
[c] Compile with the switch -encoding utf8.
[c.1] As there are unchecked operations, the compiler also suggests you put -Xlint unchecked to see the warnings if needed.
[d] In the source code, you have a couple of changes to make
[d.1] You want to test CUSTOMERNAME, so you do this, obviously. Note that the source is in utf-8. If you watch it in hexedit, it shows up C3 A9 to confirm.
[d.2] You do nothing (as you do not need to do anything as word_template.xml is all ascii in this case) on the reader lines. In case, word_template.xml is not ascii but contains geniune utf-8 characteristic text, you set it up in a similar fashion as for the writer below.
[d.3] Set up the write in a slightly more elaborated fashion. Elaborate further if you think appropriate.
[d.4] Then since writer now does not support write.newLine(), you do it in an alternative way. (Make it more economic yourself, not calling System.getProperty() each time.)
[e] Since I claim you do not need XmlEncode(), it is retired.
That's about it.
Peter Cong wrote:... my java web project which use this code is using eclipse flatform... So since I can not resave my java source codes to utf-8 format...
Paul Clapham wrote:
Peter Cong wrote:... my java web project which use this code is using eclipse flatform... So since I can not resave my java source codes to utf-8 format...
Sure you can tell Eclipse to use UTF-8 for your Java source code. Window -> Preferences -> General -> Workspace : Text file encoding.
I strongly recommend you do that.
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime. |