Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

how to check non english string  RSS feed

 
ravindra patil
Ranch Hand
Posts: 234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i have a file its content are in english and in other languages also
i want to create log file when i encounter a non english string
so how to do this checking
:roll:
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24215
37
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think there's a perfect answer to this question, because many languages besides English are written using the same alphabet as English. I'm no expert on this topic, but I think the first thing I would try would be checking for Java char values greater than 127 (the upper limit of the ASCII set). For strings written entirely in non-Roman alphabets, this would work perfectly. For things like French or Spanish that use the larger Latin-1 set, some words would be entirely in ASCII, while others would include some characters in the range 128-255. `

Can I ask you a question unrelated to your problem? I've often wondered why people use the " :roll: " smilie the way you've used it here. I think it's a cultural thing. To me, this usage seems really inappropriate; this smilie indicates sarcasm or disbelief, neither of which seems right here. But I think there must be another interpretation. Can you explain what you mean by using it?
 
Purushoth Thambu
Ranch Hand
Posts: 425
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To check if the string contains any characters with ASCII great than 128 you can use the regular expression \p{ASCII}. You must think about comments posted by Ernest for handling Latin character.
 
Edwin Dalorzo
Ranch Hand
Posts: 961
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another option could be that you look up the words in a dictionary (aka word list). You could download different types of dictionaries here.

If you do not find the word there you can provide some functionality to let the user check if the word is not an English word, but if it is then you can let the user add the new word to the dictionary.
[ November 03, 2006: Message edited by: Edwin Dalorzo ]
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In other words we (and you) need a definition of what's meant by an "english string".
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also, do you know what type of file encoding is used in this file? Checking for non-ASCII character values sort of assumes you've already successfully decoded the bytes in the file - which may require that you know whether it's using UTF-8 or Cp-1252 or one of many, many other possibilities. So if you already know the answer to this question, great, but if not, you should probably try to answer it first. This is a good introduction if you are unfamiliar with the ideas of character sets and file encodings.
 
ravindra patil
Ranch Hand
Posts: 234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ernest Friedman-Hill:
I don't think there's a perfect answer to this question, because many languages besides English are written using the same alphabet as English. I'm no expert on this topic, but I think the first thing I would try would be checking for Java char values greater than 127 (the upper limit of the ASCII set). For strings written entirely in non-Roman alphabets, this would work perfectly. For things like French or Spanish that use the larger Latin-1 set, some words would be entirely in ASCII, while others would include some characters in the range 128-255. `

Can I ask you a question unrelated to your problem? I've often wondered why people use the " :roll: " smilie the way you've used it here. I think it's a cultural thing. To me, this usage seems really inappropriate; this smilie indicates sarcasm or disbelief, neither of which seems right here. But I think there must be another interpretation. Can you explain what you mean by using it?





Actually i alrady tried with ASCII values in my program i put one japanese character and prints its ASCII value then it print it as 63 ,
i try to print String s="someJapanesecharactesHere"
then output is in ??? so what does it mean

my aim is to skip the line from file other than english,i am reading one line at a time and getting first charcter and checking that character with '?' for which it is working properly but it is retricted only for japanese so why do i get ascii value 63

and

:roll: i put here is for i am thinking on this topic so please help me
thanks give reply soon
 
Purushoth Thambu
Ranch Hand
Posts: 425
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I believe you didn't read the link Jim posted. First it's important to understand what's the encoding of the file you are processing, without that you won't succeed in your attempts.
 
Jeroen T Wenting
Ranch Hand
Posts: 1847
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A dictionary won't work. Many words exist in multiple languages so just skipping a line if it contains a certain percentage of words not known in a list of words known to exist in English (however complete that list is) will still lead to lines being skipped and others included that shouldn't be.
You'd also need some form of grammar check as well, and even that wouldn't yield certainty as lines could contain a single word or a specific sequence of words that doesn't constitute a complete and valid sentence.

The language you want to filter being English, you're presented with even more problems.
The different local variations in English mean that things that are valid in say American English might trigger a checker that's defined for UK or Australian English, and a checker that allows for all of them would likely call things valid that aren't proper English in any variation on the language.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!