• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

How to find encoding of byte[]

 
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Gurus,

I have a web service that is getting a file(HTML,XML or TXT) as byte[].I need to know what encoding this byte array is in.We have to distinguish between 3 encodings - UTF-16,UTF-8 and ANSI.
Is there a sure shot way to do this?Please help.

Thanks
-Jitesh
 
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:
I have a web service that is getting a file(HTML,XML or TXT) as byte[].I need to know what encoding this byte array is in.We have to distinguish between 3 encodings - UTF-16,UTF-8 and ANSI.
Is there a sure shot way to do this?Please help.



Short answer - No. Slightly longer answer - utf-16 encoding of western languages will likely to have every alternate byte a zero but both utf-8 and ANSI (by which you probably mean CP-1252) will not normally have alternate bytes of zero. Telling the difference between utf-8 and CP-1252 is very dificult.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So what do you suggest is the best way of dealing this?Should I ask clients of web service to provide the encoding as a separate parameter ?That way I will know it beforehand and do my manipulations based on this additional parameter?
 
James Sabre
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:So what do you suggest is the best way of dealing this?Should I ask clients of web service to provide the encoding as a separate parameter ?That way I will know it beforehand and do my manipulations based on this additional parameter?



Unless there is only one possible encoding one should always get the character encoding delivered with the bytes.
 
Marshal
Posts: 25805
69
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would go one step farther and specify the encoding which you clients must use. UTF-8 would be a good choice if you want something non-specific. And by the way "ANSI" isn't an encoding.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks James and Paul.That was helpful.

Ok so I am going to go with UTF-8 encoding.I have done some proof of concept kind of coding on my local set up.What I see is that if I get byte[] as UTF-8,first two chracters(BOM characters) are additional ones that user did not intend to put.I have to convert this byte[] to a String and send it as an input to another application.That application does not like these BOM characters and behaves weirdly.So I need to find out a way to remove these characters.
What I can do is that I can just ignore first 2 characters.Is that a good way of dealing with this problem?
 
author and iconoclast
Posts: 24203
43
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you convert the byte[] to a String, you must specify "UTF-8" as the encoding; the "extra" characters will then disappear.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Ernest.

How do I identify if a string or byte[] is having UTF-8 encoding - It seems I cannot get away from having to differentiate between UTF-8 and ASCII.

 
James Sabre
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:
How do I identify if a string or byte[] is having UTF-8 encoding - It seems I cannot get away from having to differentiate between UTF-8 and ASCII.



Wikipedia has a list - http://en.wikipedia.org/wiki/Byte-order_mark .

If you are receiving the bytes as an InputStream then you can remove them using my class BOMStripperInputStream which detects any of the standard BOM and eliminates them. I originally published BOMStripperInputStream on the Sun Java forums but since Oracle's takeover of Sun it is hard to find in the remnants of that site. The class was plagiarised on several sites but as far as I am aware is now only available on http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=85. After I tackled the 'train-graph' they added my handle as the author. I have a new improved version that I will publish soon.


Edit : Google indicates that BOMStripperInputStream is available on at least 4 (independent ?) sites. It looks like the plagiarised site has itself been plagiarised!
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
James,that is awesome.
I just need to identify if byte[] is UTF-8.Any good way of doing this?
Once I identify if a byte[] is UTF-8,I can treat it like UTF-8 as Ernest said and then I do not need to strip any characters.

Thanks.
 
Paul Clapham
Marshal
Posts: 25805
69
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:I just need to identify if byte[] is UTF-8.Any good way of doing this?



There isn't any good way. One of the less good ways is to assume it is UTF-8, and if you get an exception, then start over and read it as ASCII. But if you're going to support more than one encoding then you should require your users to specify what encoding they are using. Or say that the default you will use will be ASCII unless they specify otherwise. It's your web service and you have the right (and the responsibility) to specify how people can use it.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Paul...Can't I just check for the presence of BOM characters to see if a byte[] is UTF-8?
 
Marshal
Posts: 70225
282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No. You can go a long way with an ISO8859-1 encoding, thinking it is UTF-8, before you find out to the contrary.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I see - you mean those characters(i.e. BOM characters) can be present in ISO8859-1 encoding as well?
 
Paul Clapham
Marshal
Posts: 25805
69
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:I see - you mean those characters(i.e. BOM characters) can be present in ISO8859-1 encoding as well?



No, I don't think that's likely. But it's perfectly possible to write UTF-8 data without a BOM, so you can't rely on the absence of a BOM to tell you anything.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks Paul.

I am facing a weird issue .
When I do
String str = new String(myByteArray,"UTF-8") ;

str has <U+FEFF> appended at the beginning.Even if I do not specify any encoding in the above constructor,these characters still come up.
myByteArray is a byte[] that contains an uploaded file through JSP that was saved using UTF-8 encoding.

Can someone please help?Thanks!

 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Upon searching for <U+FEFF>,it seems it is BOM for UTF-16 - so that means even if I save using notepad in UTF-8 encoding on my local windows o/s,it will behave like UTF-16!!
So should I request clients of web service to make sure they do not have BOM characters in their files??
 
James Sabre
Ranch Hand
Posts: 781
Netbeans IDE Ubuntu Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:thanks Paul.

I am facing a weird issue .
When I do
String str = new String(myByteArray,"UTF-8") ;

str has <U+FEFF> appended at the beginning.Even if I do not specify any encoding in the above constructor,these characters still come up.
myByteArray is a byte[] that contains an uploaded file through JSP that was saved using UTF-8 encoding.

Can someone please help?Thanks!



I thought we had covered all of this. If you go to the BOM Wikipedia page I cited you will find that FEFF is the BOM for UTF-16 (BE) so your bytes are not UTF-8! When you constuct the String specify UTF-16BE and specify starting from byte 2 !
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jitesh Sinha wrote:How do I identify if a string or byte[] is having UTF-8 encoding - It seems I cannot get away from having to differentiate between UTF-8 and ASCII.


If you only have to differentiate between UTF-8 and ASCII, then that's easy: ASCII is a subset of UTF-8, so you could always regard it as UTF-8 and it will work.
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am having issues with extended ASCII character sets.The character ’ (ASCII value 146) becomes garbled (? or blank square) when converted to UTF-8.This happens when a file is saved in "ANSI" encoding by using notepad and this file is read by my java code.

If the same file is saved in UTF-8 encoding,it is all fine.
Does that mean ASCII values within 127 only can be converted to UTF-8 without being garbled?
Can someone please explain?
Thanks!!
 
Jitesh Sinha
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I wrote a standalone program to just read the file content byte by byte and write it to another file.This worked absolutely fine- preserved all characters.

In my application though, I need to call a Rest Web Service.I am using Jersey client api to call the service.My web service never gets the data correctly(for the kind of content mentioned in above post) - when it gets the content ,it is already garbled.
I am writing both client and service side of the code.

Both sides are using spring framework.
On service side,we are using org.springframework.web.filter.CharacterEncodingFilter to set the encoding to UTF-8. We have done this in web.xml.
I guess this much should be ok.

When I do command.getFile(),that is when something goes wrong.There is not much code involved here,so I suspect something fundamental is going wrong.But I have no clue what that is.

Please help.

Thanks!
 
Don't get me started about those stupid light bulbs.
    Bookmark Topic Watch Topic
  • New Topic