Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Reading from URL, problems with encoding

 
Alex Gli
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to write an application that would read a file from the internet (www.example.com/file.html), do some editing and then write it to a file on my disk. The problem is that central european characters are not shown correctly in the file on my disk. I know that web page uses iso-8859-2. I tried a few things but was not successful. How should I modify my code to get the proper result?

[ October 11, 2003: Message edited by: Alex Gli ]
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The trick is to get encodings right.This is mistake #1. The InputStreamReader needs to know about the specific encoding it is getting -- pull it from the HTTP response headers or, slightly uglier, hardcode iso-8859-2. Check the javadoc API for the appropriate constructor.And this is certainly a cardinal sin in internationalised Java. You are using the write(int) method of OutputStream, which will just chop off the top 8 bits of your char and write out a byte. This basically ignores any encoding that's being used and will only ever work properly for 7-bits ASCII stuff. What you need to do is use FileWriter instead of FileOutputStream; this will write Strings directly using your default encoding. Alternatively, if the default encoding won't do, simply wrap your FileOutputStream inside an OutputStreamWriter; you can use the latter's constructor to ask for any encoding that takes your fancy. As long as it is supported by your JRE, of course.
- Peter
[ October 11, 2003: Message edited by: Peter den Haan ]
 
Alex Gli
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanx for your suggestions Peter, i kind of got it working. Now can you help with some code that would get encoding of a particular file on the internet. Is there a method or do I have to check for <meta> tag to get proper encoding?
thanx in advance
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The character set used is returned as part of the HTTP headers, not necessarily of the actual response body. For instance, this JavaRanch page arrived at my browser with the following headers:(courtesy of Mozilla Firebird with the Live HTTP Headers plugin). As you see, it's the Content-Type header that (optionally) supplies you with the encoding being used on the web page. To get at the HTTP headers, don't open the input stream from the URL object but open the connection explicitly:Hope this helps,
- Peter
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic