• Post Reply Bookmark Topic Watch Topic
  • New Topic

"Contains" with UTF-8  RSS feed

 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just don't see it right away. How would I find a UTF-8 "substring" in a UTF-8 word (I am trying to state this without using Java objects and primitives to cloud my thought).

Example: Москва. Does this word contain the character sequence скв? Answer: Yes.

Since we are dealing with byte[] when talking about character encoding the problem does not seem so straight forward.
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have sort of a hack of a solution for Russian Cyrillic, but am still looking for a more general algorithm / solution in Java. Here was my scenario which worked for me.

I crawled a bunch of Russian documents and wrote them to disk as UTF-8 files. I then tokenized based upon spaces to a single UTF-8 file which contained a long list of words, most of them presumably Russian words. I then created another file with the entire Russian Cyrillic alphabet: upper and lower case. For some reason in my word file, there were 3 bytes prepended to each line for whatever reason, followed by 2 bytes per Cyrillic character very predictably. I know not everything will be 2 bytes across alphabets, but this is the case for Cyrillic. So an 8 character Russian word will be 19 bytes: I ignore the first 3, and then compare each 8 pair of bytes with the alphabet file I created, which I read in as a single String. Based upon this, I was able to successfully pick out all Russian words or at least all words containing only Cyrillic characters which is equivalent to checking if a word contains a non-Cyrillic character or if each character in a word is contained in the String representing the Russian Cyrillic alphabet.

This solution fails in a number of other language situations and again I would like a more general solution that can work for any language, or perhaps a library out there that attempts to do such a thing.

 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since we are dealing with byte[] when talking about character encoding

I don't understand this - why would you want to deal with byte[] when the data is actually text? What prevents you from keeping two strings and then using the methods of the String class?
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:
Since we are dealing with byte[] when talking about character encoding

I don't understand this - why would you want to deal with byte[] when the data is actually text? What prevents you from keeping two strings and then using the methods of the String class?


Ah yes, as long as I maintain the text as an object which is a CharSequence at all times throughout my execution path, then I can use contain. So if I need a single character from a string, I need to do like substring(pos, pos+1) where pos is an arbitrary position.
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Greg Werner wrote:So if I need a single character from a string, I need to do like substring(pos, pos+1) where pos is an arbitrary position.

Better is just call charAt(pos) to get the character at the specified position.

But you don't need to do complicated things with separate characters, just use the contains() method of class String:
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Better is just call charAt(pos) to get the character at the specified position.


No, that was the approach which was not working for me. I was trying to do a character by character get and comparing that with a String containing the entire alphabet. The char returned did not match my String
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you post your non-working code? Because it really sounds like you're looking for a complicated way to do something simple...
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Greg Werner wrote:
No, that was the approach which was not working for me. I was trying to do a character by character get and comparing that with a String containing the entire alphabet. The char returned did not match my String


Keep in mind the difference between bytes and chars. In a UTF-8 file, Cyrillic characters are going to be (mostly) 3 bytes each(*) -- but in a Java String, they're still one char each.

(*) UTF-8 encodes Unicode chars using one byte for ASCII, and 3 for most of the rest of the world's alphabets. Works great for the US, lousy for everyone else!
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In a UTF-8 file, Cyrillic characters are going to be (mostly) 3 bytes each(*) -- but in a Java String, they're still one char each.


No, they are 2 bytes, I saw this by doing .toByteArray() as mentioned. Let us just close this one, we are going in circles and I am able to do what I need to do.
 
Campbell Ritchie
Marshal
Posts: 56546
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is because Java uses UTF-16 as its default encoding for Strings.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!