Win a copy of Beginning Java 17 Fundamentals: Object-Oriented Programming in Java 17 this week in the Java in General forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Ron McLeod
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Junilu Lacar
  • Rob Spoor
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Tim Moores
  • Jesse Silverman
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Piet Souris
  • Frits Walraven

Regular expressions - Split text using any chars except letters

 
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello ,

Suppose i've a JTextArea where a user enters some text in any language, and i want to grab each word the user entered, so i want to split that text with any character like this :

This is okay for English text, but i want this also to work if user entered German/Turkish/Arabic/ .... text.
Is this possible ?

Thanks.
 
Bartender
Posts: 1561
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
why not just split on white space?
i.e.,


This will still leave punctuation marks present though.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
yes, right ... but each language has it's own punctuations which i want to use in splitting the text too.
But if there was no other solution then that's my second option.
 
Marshal
Posts: 74725
336
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Have a look at the Java™ Tutorials section, particularly about the predefined character classes. You might be able to create a class for "not something" which might help.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Campbell ... I've read it.
But sorry, i don''t get it. What's the difference between what you're suggesting & the code example i introduced :

 
Campbell Ritchie
Marshal
Posts: 74725
336
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.
 
Sheriff
Posts: 22575
122
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.


\w is explicitly specified as "A word character: [a-zA-Z_0-9]". I've tried with é but that was used to split on.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well ... Temporarily i did this :

So i'm splitting the text using any non-letter English character. But as i said this is not nice, if the user entered any non-letter character of any other language like German or Arabic, it won't split the text.

Any ideas ?
 
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As I very rarely use them, I don't know much about regexes, but I was amazed that they didn't support matching of non-English characters.
So I had a little browse around the web and i found this.
I don't know if it will solve your problem, but from a non-regex expert reading of it, the Unicode Character Properties section looks like it may be useful.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Joanne.
I guess we were injustice to Regular Expressions, this code will just do it :
\p{L} will return any Unicode letter.
For more info. : Regular Expressions

Thanks
 
Campbell Ritchie
Marshal
Posts: 74725
336
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Damn! \\w wouldn't work. But well done, Joanne.
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Actually, reading thru the Unicode Character Properties section of that link, it appears accented characters can be either one or two unicode codepoints.
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.


Let me check i understand you right, do you mean a word like this :
"tt/" or "tté3"

Right ?
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".



So, if the à encoded is as U+0061 U+0300, then the first character is the latter a and the second character is not a letter (it is a mark) and so could break any regex that is looking for a string of letters.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?

If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Hesham Gneady wrote:Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?



Yes

Hesham Gneady wrote:If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?



As I said earlier, I'm not a regex expert. Best way to find out if it's right is to test it.
 
reply
    Bookmark Topic Watch Topic
  • New Topic