• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to identify white space characters  RSS feed

 
bob connolly
Ranch Hand
Posts: 204
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi there!

Is there a class that will read 1 character at a time and print out what the ASCII value is for a particular white space or invisible data type? ie \n, \r, \t, \f ect?

Thanks much
 
James Swan
Ranch Hand
Posts: 403
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah it's a specialised class called FunWithChars



 
Corey McGlone
Ranch Hand
Posts: 3271
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My guess is probably not. Those characters are OS specific (\t is tab in Windows, but I'm not sure what represents tab in UNIX). As Java tries to be "OS independent," I doubt you'll find anything that readily converts a char into something of that form. I'm guessing that, if you want to accomplish this, you're going to have to "roll your own" in one way or another.

Perhaps someone else has an idea, but that would be my guess.
 
bob connolly
Ranch Hand
Posts: 204
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the ideas!

Well to be more specific, i'm trying to parse a WORD doc and there is a unicode character called the currency sign '\u00A4', which is being used as some kind of paragraph break, in addition to the standard '\n' '\r' ect!

So i'm trying to figure out how to specify the logic to identify this UNICODE character!

Right now i'm using the following statement: if (c=='\n' || c=='\t' || c=='\u00A4') but it doesn't seem to recognize that UNICODE specification!

Thanks!
 
Tony Morris
Ranch Hand
Posts: 1608
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator



My guess is probably not. Those characters are OS specific (\t is tab in Windows, but I'm not sure what represents tab in UNIX).


Rubbish.
Have a look at an ASCII table and the name of the character 0x09 (9).

[ July 27, 2004: Message edited by: Tony Morris ]
[ July 29, 2004: Message edited by: Tony Morris ]
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well - ascii is a standard over many platforms, and is the same for unix and windows for 0-127 (7bit).
And of course \t was a unix-tab when dos wasn't invented.

But \u00A4 which is 164(dec) is outside off the standard, and not a whitespace - though perhaps invisible in ordinary editors.

164 is at least a 8bit-character in the extended ascii charset.

Java-characters are 16 bit, and \u00A4 is a 16-bit notation too.

Perhaps you may use a hex-editor, to find out the position, where the � is printed, and try to find out, what java is reading.
Perhaps you have to tell the InputStream, which encoding to use?
Or ask, which encoding it is actually using?

But I don't know, which encoding word-docs use.
There is an apache - openSource - api available, to read Excel and Word docs - POI and H?? (poor obfuscating interface/ horrible ... ...).
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!