• Post Reply Bookmark Topic Watch Topic
  • New Topic

regex Pattern class and spaces  RSS feed

 
Rajagopal Manohar
Ranch Hand
Posts: 183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I was using the regex package to format data read from a excel spread sheet. I wanted to collapse a set of continous white spaces to a single white space. so I used the Pattern "\\s{2,)" and replaced it with " ".

But I found that it worked only partially. Trying to debug I realised that the some of the data used some "no break space" in unicode with int value of the char being 160, which was missing in the \s pattern.

So I had to do some thing clumsy like
char space = 160;
pattern = "[\\s" + space + "]{2,1}";

Now what I cannot understand is why does the \s pattern class not include this space (char 160). And how do I know that tommorow if I try to read data from another file system in another platform I will not encounter a new space char. Does it not make my code platform dependant (otherwise \s should have handled all possible white spaces)

just a thought, I am sure there is a better explanation

regards,
Rajagopal
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ascii(160) isn't allways a kind of space.
On Dos it is � AFAIK.
 
Rajagopal Manohar
Ranch Hand
Posts: 183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stefan Wagner:
Ascii(160) isn't allways a kind of space.
On Dos it is � AFAIK.


Does that mean that when I see a " " on screen on one platform save it and read it in another platform then I will see a "�". i'snt that a strange behaviour.

does java not promise platform independence? is there no way to guarantee
a common interpretation on all platforms

ps: forgive my ignorance but I thought in java every thing was converted to unicode. apparently i am wrong
[ May 16, 2005: Message edited by: Rajagopal Manohar ]
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, Java uses Unicode internally, so ASCII 160 will always be a non-breaking space as far as Java is concerned. To match it, just use the Unicode escape for the character:If you're normalizing the whitespace, shouldn't you also be converting single linefeeds, tabs, NBSP's, etc. into space characters?That is, any two or more consecutive whitespace characters, or any single whitespace character that isn't a space (ASCII 32).
 
Rajagopal Manohar
Ranch Hand
Posts: 183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you're normalizing the whitespace, shouldn't you also be converting single linefeeds, tabs, NBSP's, etc. into space characters?


I guess yes.
Thanks
Rajagopal
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!