• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Devaka Cooray
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Martijn Verburg
  • Frits Walraven
  • Himai Minh

Regex with Unicode Text (Devanagri Script)

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am posting the topic in the forum "Java in General". I dont know if this is the right place for it.

I am using regular expressions on Devanagri Script files (Unicode text).
Here is the program :

public class KonRegex extends JFrame implements ActionListener{

Container cp;
JTextField itxt;
String kip = null;

public KonRegex() {

cp = getContentPane();
cp.setLayout(new FlowLayout());

itxt = new JTextField(15);
cp.add(itxt);

JButton b1 = new JButton("View");
b1.addActionListener(this);
b1.setActionCommand("View");
cp.add(b1);

addWindowListener(new WindowAdapter( ) {
public void windowClosing(WindowEvent e) {
setVisible(false);
dispose( );
System.exit(0);
}
});

setVisible(true);
setSize(500,400);
}

public void actionPerformed(ActionEvent e) {
if ("View".equals(e.getActionCommand())) {
kip = itxt.getText();
System.out.println(kip.getCharacterEncoding());
RegexMatch();
}
}

// Find a match
public void RegexMatch() {
String value = null;
try{
Pattern pat = Pattern.compile(kip,Pattern.CANON_EQ);
Matcher match = pat.matcher(fileContent("Kon.txt"));
while(match.find()){
value = match.group();
cp.add(new JLabel(value));
}
validate();
}
catch(IOException ioe){
System.out.println("Error in io");
}
}

// convert input to CharSequence
public CharSequence fileContent(String fname) throws IOException {

FileInputStream f = new FileInputStream(fname);
FileChannel fc = f.getChannel();

ByteBuffer buf = fc.map(FileChannel.MapMode.READ_ONLY,0,(int)fc.size());
CharBuffer cbuf = Charset.forName("UTF-16").newDecoder().decode(buf);

f.close();
fc.close();
return cbuf;
}

public static void main(String args[]){
KonRegex kt = new KonRegex();

}
}

1. The program throws a PatternSyntaxException for conjuncts having 3 or more Devanagri chars combined eg. ध्वं, ल्ल्य
Am I doing something wrong in the program for this to happen?

2. Combined chars represented by multiple code points do not match with certain regex. For eg. the expression स.र does not match प्रसार as सा is a combination letter. Is there a way to handle these type of cases?

3. Any input on the above code or any suggestions for related reading material appreciated.
 
Ranch Hand
Posts: 2908
1
Spring Java Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to JR !!

First, UseCodeTags

Second, you're talking about regex, and I haven't found one.

1. The program throws a PatternSyntaxException for conjuncts having 3 or more Devanagri chars combined eg. ध्वं, ल्ल्य
Am I doing something wrong in the program for this to happen?


We need regex for the same.

2. Combined chars represented by multiple code points do not match with certain regex. For eg. the expression स.र does not match प्रसार as सा is a combination letter. Is there a way to handle these type of cases?


Why ? सा and स are two different letters, right ? Why do you want to match them ?

any suggestions for related reading material appreciated.


RE & Unicode
 
New rule: no elephants at the chess tournament. Tiny ads are still okay.
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic