This week's book giveaway is in the Performance forum.
We're giving away four copies of The Java Performance Companion and have Charlie Hunt, Monica Beckwith, Poonam Parhar, & Bengt Rutisson on-line!
See this thread for details.
Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular expression pattern for Non-Ascii characters

 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to write pattern to find Non Ascii characters from input using reg ex pattern?
Whcih includes TAB,"",punctuation..
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Pie
Posts: 15364
40
Android IntelliJ IDE Java Scala Spring
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tab and punctuation are certainly ASCII characters. So, what do you really mean by "non-ASCII" characters? You have to be precise if you want a good answer.
 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks..
First we can write regex for below allowable characters.
Remaining are Non-Ascii.

Ascii characters
Char >= 32 && Char <= 255

Country specific allowable characters
0x15E,0x15F,0x162,0x163,0x102,0x103,0xCE,0xEE,0xC2,0xE2
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:
Char >= 32 && Char <= 255


The ASCII character set does not include values greater than 127 and it does include characters less than 32 so it sounds like you don't actually mean ASCII .

Also, your last post seems to contradict your first post. Do you want to extract from a String the ones that are in your specified set or to remove from a String the ones that are in your specified set.
 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Richerd.
Sorry for confusing requirement.

Need to filter Non-Ascii charaters from user input using RegEx pattern based on country specific.
The application support multiple countries.

Could you please tel us your design approach how to achieve this ?

 
fred rosenberger
lowercase baba
Bartender
Posts: 12146
30
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:Need to filter Non-Ascii charaters from user input using RegEx pattern based on country specific.
The application support multiple countries.

your REQUIREMENT is to use a regex? That is not a good requirement. It should tell you what you need to accomplish, but not dictate HOW you do it. I would go back to whoever wrote that spec and tell them to try again.
 
Paul Clapham
Sheriff
Posts: 21133
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I suspect the whole requirement is bogus, not just the part which requires the use of a regex. I suspect it's going to prevent me from using the characters é or ™ because I'm in an English-language environment and everybody knows that you don't use those characters in English.

But generally we can't control requirements given to us by higher-ups, and if the requirement is actually bogus there's nothing we can do about that. So my approach would be to ignore anything referring to "ASCII", since that appears to be a red herring, and just get a list of permitted characters for each language. It's easy enough to write a regex to match a list of characters -- even a regex klutz like me should be able to do it.
 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to achieve this?
Please give design approach.

 
Ivan Jozsef Balazs
Rancher
Posts: 979
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:
Please give design approach.


The hints on the design seem to have been ignored by you.


Ascii characters
Char >= 32 && Char <= 255

Country specific allowable characters
0x15E,0x15F,0x162,0x163,0x102,0x103,0xCE,0xEE,0xC2,0xE2


What about this regexp?

^[ -\u00FF\u015E\u015F\u0162\u163...]$

That is
"begin of string, then any character (from space to 0xFF or in the list of the 'country specific allowable characters') any times and then the end of the string"
(I was lazy to write them all, the three dots stand for the continuation.)

 
Ivan Jozsef Balazs
Rancher
Posts: 979
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Country specific allowable characters


Which country is it? Romania?
 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes it is Romania.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:Yes it is Romania.


But we still don't know your requirement ! We don't know whether you want to remove the invalid characters, create a set of the invalid characters contained in your input or just simply say whether or not the input has invalid characters. Obvioulsy the regex for these three requirements are very closely related but not necessarily the same.

So, what is your input and what is your desired output?
 
Ivan Jozsef Balazs
Rancher
Posts: 979
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:Yes it is Romania.


I happened to live a neighbouring country and though I do not speak Romanian, I somehow recognized the letters.
Are you sure about the requirement?
In texts at least in people's names other character might also occur, given the fact people of other mother tongues
(using different extension letters to the Latin alphabet) also live there.

Romania used for a while the Cyrillic alphabet and (albeit a country of orthodox faith) switched to Latin later.
 
Raghu Sha
Ranch Hand
Posts: 122
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Richerd.

It should filter Non-Ascii characters from user input.
If user enters, Non-Ascii characters in input, it shouldn't go to data/service layer. (remove those nonAscii chars)

Thanks
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Raghu Sha wrote:@Richerd.


Err..... Richard.


It should filter Non-Ascii characters from user input.
If user enters, Non-Ascii characters in input, it shouldn't go to data/service layer. (remove those nonAscii chars)

Thanks


I thought this ASCII requirement had been discarded since you have agreed that you don't actually mean ASCII ! As far as I can see you still have not defined the actual set of characters you wish to keep or the characters you wish to discard.

You need to use the String.replaceAll() method or the java.util.regex.Matcher.replaceAll() method. You need to spend some time learning about regular expression in general and regular expression in Java. Take a look at http://www.regular-expressions.info/tutorial.html and http://docs.oracle.com/javase/tutorial/essential/regex/.

P.S. Once you have define the character set the regex you need is trivial.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic