• Post Reply Bookmark Topic Watch Topic
  • New Topic

ISO 8859-1 and nothing else...  RSS feed

 
L. Stinson
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a requirement to allow only ISO 8859-1 characters only. Is there a way to generate an error message when a character is read in that is not in this list? I tried
BufferedReader nb5File = new BufferedReader(new InputStreamReader(new FileInputStream(linearSpatial), "8859_1"));

It just lets in any characters.
 
Elihu Smails
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
looks to my like the perfect situation for writing your own InputStream or InputReader. I would start by overriding int read() and checking to make sure that the variable is within the proper range.
 
M Beck
Ranch Hand
Posts: 323
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i thought ISO 8859-1 was an 8-bit proper superset of ASCII. wouldn't that make any eight-bit sequence potentially valid ISO 8859-1? are there any codes outright disallowed by this standard?
 
Elihu Smails
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
unless you check each 8-bit sequence against the ASCII character set and only allow a certain range.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[M Beck]: i thought ISO 8859-1 was an 8-bit proper superset of ASCII. wouldn't that make any eight-bit sequence potentially valid ISO 8859-1? are there any codes outright disallowed by this standard?

My understanding is that ISO 8859-1 does not include values 128-159. These are invalid. However many people who claim to be using ISO 8859-1 are really using Cp-1252, which is Microsoft Windows Latin 1. Cp-1252 does use these values, for things like "smart quotes" ( “ ” ‘ ’ ) . Unicode also defines uses for values 128-159 - but they're all unprintable control characters. Other encodings may have other uses for these values. If you encounter input with bytes in the range 128-159, it is important that you find out encoding is really being used. It's not ISO 8859-1.
[ January 31, 2005: Message edited by: Jim Yingst ]
 
Francis Shillitoe
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by M Beck:
i thought ISO 8859-1 was an 8-bit proper superset of ASCII. wouldn't that make any eight-bit sequence potentially valid ISO 8859-1? are there any codes outright disallowed by this standard?


I agree. Any 8-bit sequence is potentially ISO 8859-1. This is why web sites set an encoding value in the http response so the browser knows which character set to display the response in. Unless your input stream contains some pre-determined marker to indicate it is ISO 8859-1 or you trust the source you are not going to know the encoding.

Francis
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!