• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to handle all special Characters using Regular Expression in java ?  RSS feed

 
Barnabas Jeremiah
Greenhorn
Posts: 26
Chrome Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
We have a problem because of the special Characters in french and some other languages like chinese. we used to do data extraction from database as .txt file and sending to third party server. the maximum file is 928 kb.but because of the special characters in the data the file size is increased and the server is rejecting the .txt file. so I ve decided to handle these special characters and to reduce the file size using java regular expression before creating the .txt file.
Can anyone guide how to handle these special characters using Java Regular Expression or if any other method in java.

Thanks in Advance.
Jermiah.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Likes 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm a great fan of regular expression but I'm not convinced that regular expressions are what is needed. You seem to have decided on a solution to a problem without actually understanding the problem. I suspect you need to be more explicit in what you will keep and not just loosely say reject 'special characters'; by 'special characters' do you mean any character that is not in the printable ASCII ? i.e not in the range 0x20 to 0x7E inclusinve? If so then the regex is trivial so I don't understand the problem you are having.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In addition to what Richard said, removing non-ASCII characters -which is what I understand you to mean when you say "handle"- would corrupt the data. We don't know enough to say for sure, but a better solution would seem to be a switch to a more inclusive file format -possibly using UTF-8 encoding, which is widely used for non-ASCII data- and upping that (arbitrary?) file size limit.
 
Barnabas Jeremiah
Greenhorn
Posts: 26
Chrome Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes.I meant any character that is not in the printable ASCII
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Barnabas Jeremiah wrote:Yes.I meant any character that is not in the printable ASCII


The regular expressions for both the printable ASCII set and the non-printable ASCII sets are very very simple and just a short time studying regular expression should allow you to write them. For Java regular expressions take a look at http://docs.oracle.com/javase/tutorial/essential/regex/ and the Javadoc for class java.util.regex.Pattern. There is a good general regular expressions tutorial at http://www.regular-expressions.info/tutorial.html . In my view the regular expressions bible is "Mastering Regular Expressions" by Jeffrey Friedll published by O'Reilly. The Oracle regular expressions tutorial will show you how regular expressions can be used in Java and you just have to choose the method that suits your problem.

Of course if you decide that the use of regular expressions to solve this problem is not worth the effort of learning about regular expressions then the problem is trivial using simple character manipulation.

 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Barnabas Jeremiah wrote:Yes.I meant any character that is not in the printable ASCII

So presumably that would include 'é'.

And since it's one of the most common "letters" in French, you'd better know exactly HOW to "handle" them before you write one line of Java code, otherwise you'll end up with complete gibberish.

And to be honest I suspect, like Richard, that regexes are not in fact what you want.

Winston
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Barnabas Jeremiah wrote:We have a problem because of the special Characters in french and some other languages like chinese. we used to do data extraction from database as .txt file and sending to third party server. the maximum file is 928 kb.

And, just to be clear, THAT would seem to me to be your problem, not the fact that your database contains multi-lingual characters. It's an arbitrary limit set by someone else that wants to use your text, and they're expecting you to do their work for them.

First off: 928k is a LOT of text. About a short novel's worth, actually.

Second: Have you thought about alternative strategies like splitting it up, or indeed compressing it?

Don't decide on a solution before you understand the problem.

Winston
 
Barnabas Jeremiah
Greenhorn
Posts: 26
Chrome Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your reply.No I dont want to split or compress it.yes it is a big data to the third party server which is scanning all the data in the file. if it finds any character which has 2 bytes then it is not accepting the file to upload.
 
Paul Clapham
Sheriff
Posts: 22835
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So is this server just scanning for size, or is it actively prejudiced against those particular characters?

I can't believe that it's rejecting non-ASCII characters in documents written in languages which use them. Imagine a server which wouldn't let you upload English documents containing the letter "y". Would you use it? I certainly wouldn't. So under the assumption that it doesn't accept anything larger than X bytes, I would suggest just truncating the documents to that size, rather than butchering them.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Barnabas Jeremiah wrote:Thanks for your reply.No I dont want to split or compress it.yes it is a big data to the third party server which is scanning all the data in the file. if it finds any character which has 2 bytes then it is not accepting the file to upload.

Then you have a problem Barnabas, because you've plainly already decided on a solution (which isn't working) and won't accept any other possible way of doing things, so what you're actually asking us is:
How do I get MY solution (and only MY solution) to work?

The problem is NOT with your data, but with somebody else's arbitrary (and, I would say, stupid) requirements. Your data was plainly designed for multi-lingual text and large volumes, so why on earth is someone who is parsing it rejecting (a) multi-lingual characters, and (b) downloads above a certain size? Seems insane to me...

The only other possibility I can think of is to filter the output so that they only receive English documents, but even then, you might run into foreign words like "émigré", which have made it into the English language. Not to mention documents which simply ARE bigger than 928kb.

Like I say, it sounds to me like somebody is trying to get you (or your company) to do their work for them.
If you're not prepared to take any of the suggestions you've already been given, I'm not quite sure how we can help.

Winston
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!