This week's book giveaway is in the OCAJP forum.
We're giving away four copies of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) and have Khalid A Mughal & Rolf W Rasmussen on-line!
See this thread for details.
Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Question strings and character encodings

 
Robin Dee
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello all,

I'm quite new to Java (not to programming though) and I've run into some kind of problem in a tiny app I wrote. I hope I can get some help here :-).

What's going on? Using Apache's PDFbox I extract some text from PDF files. After extracting text from an PDF, I MD5 the text and store that in a database. That works just fine. Except... in some special cases. If the characters encountered are non-ASCII characters, the outcome of the MD5 hashing is different when I run my Java app on a Linux or a Windows system. My guess would be the difference in character encodings used by Linux and Windows.

What would be a good way to solve this issue? Can I force my string to be converted into some specific encoding (LATIN-1 for example) before applying the MD5 hash in order to guarantee identical results on Windows and Linux?

Best, Robin
 
Paul Clapham
Sheriff
Posts: 21316
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.

And I think you have the string encoding concept backwards. You said
Can I force my string to be converted into some specific encoding

But it's the array of bytes which is in some encoding. A String is never in any encoding, since it just represents a sequence of Unicode code points. You can certainly encode a String into an array of bytes using any encoding you like. (But Latin-1 would be a bad choice, since it can't represent all Unicode characters.)
 
Robin Dee
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.


Hi Paul,

I'm using only strings; I store the output of the PDF text extraction in a string-typed variable and both the input and output of the MD5 hash are strings. The output should be a string (I don't see any harm in that, as md5 hashes only contain ASCII chars, right?).

Thanks!
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.
 
Robin Dee
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jelle Klap wrote:This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.


Hi Jelle,

I'd figure I only need to do the first thing: obtain a byte representation of the String and pass that to my MD5 hasher...? As the output of the MD5 hasher is ASCII encoded? Or would there be any good reason to convert the digest to BASE64 and store that?

Best,Robin

ps Dutch I presume? ;)
 
Paul Clapham
Sheriff
Posts: 21316
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know what MD5 hasher you are using. The result of an MD5 hash is a 16-byte array. Not ASCII. However you may be using some option which converts that array to its representation as 32 hexadecimal characters, in which case that is ASCII.
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
An MD5 digest has a fixed length of 128 bits, which is typically returned by the digester as an array of 16 bytes.
No character encoding applied, which is where the BASE64 encoder comes in.
And yes, I am Dutch

Edit: Ugh, too slow.
 
Campbell Ritchie
Sheriff
Pie
Posts: 49770
69
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Too difficult for "beginning". Moving thread.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic