This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes I/O and Streams and the fly likes ANSI to plain text for regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "ANSI to plain text for regex" Watch "ANSI to plain text for regex" New topic
Author

ANSI to plain text for regex

Piter Smith
Ranch Hand

Joined: Feb 25, 2009
Posts: 55
for a MUD (Multi User Dungeon, text based game) client, ANSI is often used. What's the terminology for, and idiom, for converting, I believe, the encoding?

I've read http://en.wikipedia.org/wiki/ANSI_escape_code and other reference material, but don't find it particularly helpful. The basic idea is to strip out color or other formatting so that the resulting string is plain text. What I'm dealing with is a Queue<Character>:





which processes the InputStream, char by char. (not line by line, because a telnet prompt will not have a CRLF for end of line.) For printing to the screen, I want to keep the encoding, which works fine currently. For doing regex, however, the different formattings are causing difficulties.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

I really can't see what you are getting at.

Your code as it stands will read the InputStream a byte a time treating each byte as a character. The first byte/character and only the first byte/character will by put into the StringBuilder where it sits until the thread exits since the StringBuilder is declared locally and never used again inside the run() method so cannot be accessed outside the run() method. All the rest of the bytes/characters are pushed to the queue.

Now that is what it does but you don't really say what it is supposed to do and I really don't understand where regex comes into this!

P.S. Since characters are unsigned the test "ch >= 0" will always be true so does nothing.
Piter Smith
Ranch Hand

Joined: Feb 25, 2009
Posts: 55
Richard Tookey wrote:
I really can't see what you are getting at.

Your code as it stands will read the InputStream a byte a time treating each byte as a character.


Exactly as designed.

Richard Tookey wrote:
The first byte/character and only the first byte/character will by put into the StringBuilder where it sits until the thread exits since the StringBuilder is declared locally and never used again inside the run() method so cannot be accessed outside the run() method. All the rest of the bytes/characters are pushed to the queue.


Good

Richard Tookey wrote:
Now that is what it does but you don't really say what it is supposed to do and I really don't understand where regex comes into this!


The regex is in a different class and works only on Strings, this was just to show where the characters come from (Apache TelnetClient InputStream).

Richard Tookey wrote:
P.S. Since characters are unsigned the test "ch >= 0" will always be true so does nothing.


Good, it's for a MUD client, and the InputStream runs indefinitely.

The question is, how to get those characters from telnet ANSI to plain text. The diliemna is, to run regular expressions on the char data (which another class processes into a String), the encoding is not plain text.

The code was just to show how the InputStream was being processed. Another class processes the character data, etc.

Here's the type of text I'm working with:


http://www.mudpedia.org/mediawiki/index.php/ANSI_colors


pardon for not including this link earlier.





-Thufir
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

Piter Smith wrote:
Richard Tookey wrote:
I really can't see what you are getting at.

Your code as it stands will read the InputStream a byte a time treating each byte as a character.


Exactly as designed.

I can't see why you need to treat the byte as a character at this early stage. It is obvious from your ANSI citation that you are going to have to filter out the ANSI control characters so why not do it as bytes since at that time they are bytes and not characters.

Richard Tookey wrote:
The first byte/character and only the first byte/character will by put into the StringBuilder where it sits until the thread exits since the StringBuilder is declared locally and never used again inside the run() method so cannot be accessed outside the run() method. All the rest of the bytes/characters are pushed to the queue.


Good

Are you being sarcastic? If you want to ignore the first byte then just don't use it! There is no point in storing it.

Richard Tookey wrote:
Now that is what it does but you don't really say what it is supposed to do and I really don't understand where regex comes into this!


The regex is in a different class and works only on Strings, this was just to show where the characters come from (Apache TelnetClient InputStream).

Sorry but that does not make sense ! Are you saying that you only want to know how to convert the queue of Character to a String? If so then why show this code fragment at all since how you obtained the queue is irrelevant in the context of converting it to strings.


Richard Tookey wrote:
P.S. Since characters are unsigned the test "ch >= 0" will always be true so does nothing.


Good, it's for a MUD client, and the InputStream runs indefinitely.

I understood that ! I was pointing out that
could be replaced by

The question is, how to get those characters from telnet ANSI to plain text. The diliemna is, to run regular expressions on the char data (which another class processes into a String), the encoding is not plain text.


The fact that it is 'telnet' is irrelevant and just MUDdies the water. It is just a stream of bytes representing characters interspersed with ANSI codes. I don't understand the cited ANSI colour codes since there is nothing in the citation to say how values greater than 32 may be differentiated from the printable characters with values greater than 32. Presumably there is some lead in code sequence similar to the good old VT100 escape sequences. If so then you will have to write a simple parser that recognises the sequences.

I don't understand the dilemma ! Using a parser or something similar, remove anything from the stream of bytes that is not bytes and then convert to a String. The problem I see is that Java regex work on whole strings and not on sequences so you will need to break your character queue into strings based on some sensible end-of-record.

[code]
The code was just to show how the InputStream was being processed. Another class processes the character data, etc.

Here's the type of text I'm working with:


http://www.mudpedia.org/mediawiki/index.php/ANSI_colors


pardon for not including this link earlier.




-Thufir





Piter Smith
Ranch Hand

Joined: Feb 25, 2009
Posts: 55
Richard Tookey wrote:
I don't understand the dilemma ! Using a parser or something similar, remove anything from the stream of bytes that is not bytes and then convert to a String. The problem I see is that Java regex work on whole strings and not on sequences so you will need to break your character queue into strings based on some sensible end-of-record.



The Queue<Character> is used for two purposes. Perhaps at cross-purposes? Orthogonal? Firstly, it's just used (later) to print to the console, so the ANSI encodings should be kept. However, it's also used, later, for parsing.

That being said, I didn't know that you could remove from the stream of bytes anything that isn't a byte...? This will, for my purposes, leave me with just plain text? Ok, that's very interesting! I'll look into that.

The class that handles regex expects and receives a String, of course. Right now, that String is full of ANSI gibberish. So, if you have a String of ANSI text, how do you get plain text from it? Or, alternately, and I don't know what you mean, remove anything from the stream of bytes anything that is not bytes and convert (what's left?) to a String?


-Thufir
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2332
    
  50
That being said, I didn't know that you could remove from the stream of bytes anything that isn't a byte...?

Err how can a streams of bytes contain something that isn't a byte. If it did it wouldn't be a stream of bytes.

You need to find out the byte value(s) that ANSI has defined to mean the following bytes/chars represent a colour code. You then scan your input for any such values and when you find one you don't add that byte or the following n bytes to the buffer that is holding the bytes to be converted to a string.
Piter Smith
Ranch Hand

Joined: Feb 25, 2009
Posts: 55
Tony Docherty wrote:
That being said, I didn't know that you could remove from the stream of bytes anything that isn't a byte...?

Err how can a streams of bytes contain something that isn't a byte. If it did it wouldn't be a stream of bytes.

You need to find out the byte value(s) that ANSI has defined to mean the following bytes/chars represent a colour code. You then scan your input for any such values and when you find one you don't add that byte or the following n bytes to the buffer that is holding the bytes to be converted to a string.



Huh, ok, I'll look into that.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: ANSI to plain text for regex