• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Paul Clapham
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Roland Mueller
  • Piet Souris
Bartenders:

read and write multiple three character unicode words from a file

 
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Here the file  read.txt contains multiple unicode words with surrogate pairs ..i m using k=3 in the for loop to get multiple three letter words from the file read.txt..i am having trouble in counting the surrogate pairs..(i.e)the character counts along with diacritic marks...want some ideas to count the surrogate pairs...when running the below code ..i m getting this error...

java.lang.ArrayIndexOutOfBoundsException: 11
at pp.main(pp.java:35)
BUILD SUCCESSFUL (total time: 1 second)

 
Rancher
Posts: 4801
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

This is incrementing i in the outer loop.
However you are also incrementing i in the inner loop.

Now, arrayLength seems to have been set based on the number of codePoints?
I would stick some debugging in there as the value is likely incorrect.
Also stick some debugging around the value of i in these two loops.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

i tried this coding....that count the character correctly in english language ....it creats problem wen using file with unicode words with surrogate pairs...
 
Marshal
Posts: 80867
505
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Don't call close() on your readers; use try with recources instead. Don't call flush() and close() on your writers; use try with resources instead.
Doesn't String have a method for counting code points? Surely that will solve your problem with two chars to one Unicode character.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
ok....then i try this coding  using codePointCount()
 
Ranch Hand
Posts: 491
23
Eclipse IDE Firefox Browser Spring VI Editor AngularJS Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi surya,can you please explain a bit more about your expectation from the above program.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


in english ;

if the file contains

HELLO
HOW
IS
YOUR
WORK

if i use - if (stringArray[i].length() == 3)
it writes the word

HOW to the write.txt file

while using tamil language files :

eg) if a file containg words

பலன்கள்
சூரியன்
தொழில்
பலன்

if i use - if (stringArray[i].length() == 3)

it should write the word பலன் to write.txt file..

but it will throw exception errors

main problem is
பலன் -counting as four character along with dot..
i want to count it as three character
 
Campbell Ritchie
Marshal
Posts: 80867
505
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You need to remove punctuation from the text if you want to do that.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
பலன் is a single word in tamil language...
if dot is removed...it became meaningless
 
Bartender
Posts: 15743
368
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You're moving in VERY treacherous area.

There is no agreed upon meaning of the word "character". It was probably a mistake by the Java designers to call char a char, and to call Character a Character. They don't represent characters, they represent UTF-16 code units.

What is a character? Do you mean a code unit? A Unicode code point? A grapheme cluster?

ன் is one grapheme cluster. It consists of two unicode code points. There are NO surrogate pairs involved, so each of these code points is represented by ONE UTF-16 code unit.

Sadly, I don't know how to easily count grapheme clusters in Strings using Java, or if there are good libraries to do this.
 
Stephan van Hulst
Bartender
Posts: 15743
368
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You could try to use javax.swing.text.PlainDocument for this. Put your text in this document and see how many characters it reports.

I tested this, sadly it doesn't work.
 
Rancher
Posts: 5126
38
 
Campbell Ritchie
Marshal
Posts: 80867
505
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:. . . There is no agreed upon meaning of the word "character". It was probably a mistake by the Java designers to call char a char, and to call Character a Character. They don't represent characters, they represent UTF-16 code units. . . .

They copied the name from C/C++; a char in C is different, only occupying 8 bits. When Java® was introduced, Unicode only went up to 16 bits, so things like code points are later additions which are much more awkward than they would have been if >16‑bit Unicode had been around then.
 
Stephan van Hulst
Bartender
Posts: 15743
368
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, Unicode itself isn't much less awkward either. Character combining is done regardless of which UTF encoding you use.

Any method of String that accepts or returns an index or a length should be replaced by a bunch of methods that do the same, but make clear whether the method operates on code points or grapheme clusters. For instance, there should have been a codePointAt() method that returns a CodePoint, and a clusterAt() method that returns a GraphemeCluster.

It might make the API look more difficult and awkward to use, but that's a good thing, because at least it reflects the fact that String handling IS difficult and awkward.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
sorry i jus mentioned the GraphemeCluster as a surrogate pair....i am trying to count unicode without GraphemeCluster
 
Stephan van Hulst
Bartender
Posts: 15743
368
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In that case, what is wrong with how it currently works?

பலன் consists of 4 code points without surrogate pairs. That means that length() will return 4, and that the string will not appear in your output file.

Why do you want it to appear there, because ன் appears as one character? That's because it's one grapheme cluster, consisting of two code points.
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
actually i m trying to read a words from a file containing 3 letters and write those words into another file....in tamil language, பலன் is a single word ....if i remove dot...it became meaningless...so i want to count பலன் as 3 and should write into another file..(i.e i want to count the words without considering its grapheme clusters)
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
some single words with grapheme clusters

பலன்கள்-ப,ல,ன்,க,ள்

சூரியன்-சூ,ரி,ய,ன்
 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
if fileA contains the words

பல
ராசி

eg)if i want 2 letter words from the fileA..
this code jus get பல ...
its not getting ராசி..which is also a 2 letter word (i.e)ரா,சி

 
surya preethaaa
Ranch Hand
Posts: 146
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
 
Stephan van Hulst
Bartender
Posts: 15743
368
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Like I said, then you shouldn't use String.length(), because that counts UTF-16 code units; NOT Unicode code points, and also NOT grapheme clusters. You're interested in the amount of grapheme clusters in the String.

There is no easy way to get the number of grapheme clusters in a String. You'll have to look for a library that gets these, or you'll have to write an algorithm to do this.
 
reply
    Bookmark Topic Watch Topic
  • New Topic