• Post Reply Bookmark Topic Watch Topic
  • New Topic

how to detect the file type

 
Ankit Doshi
Ranch Hand
Posts: 222
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am reading a text file, which can contain either english text, or thai text or a combination of thai and text file. I have wrote following code for reading thai chars from text file



This code works fine if I know that the file being read contains thai chars only. In case the file contains plain english characters, the last line in the above code would have to be


How can I detect whether the file being read contains english characters or thai characters or combination of both?
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If it is plain english text, it should not contain characters below 0x20 (32 dec), except 0x09 (9, tab), 0x0a and 0x0d (dec. 10, and 13, CR LF).

An englisch text should contain some 'I, the, he, she, ...'.

For favorite thai-characters I don't know.
The situation might get tricky for very small texts ("You 're welcome").

You might find average letter distribution statistics or generate them from sample-data, and average combinations ('th', 'sh'). This might depend on the kind of text (technical description, science, prosa, poems, jargon) and age ("O rose, thou art sick")...

May the text be none of the 3: english, thai, combination?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!