• Post Reply Bookmark Topic Watch Topic
  • New Topic

how to do in regular expressions  RSS feed

 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello all,
I have a String which contains plain text, XML tags in it. I want to retain the text but remove the HTML tags( I know the tag names) I am planning to use regular expressions
String test ="hello <H1 att1 =\"test\" >Heading</H1> ";
and I want the output to be "hello Heading "

so I did
test.replaceAll("</*[Hh]1.*>", "");
is this the correct way of doing it ? Also if I wanted to replace a string which has & in the text String with **, how can I do it since both are used in defining the regular expressions.
Thanks much,
Adithi.
 
chi Lin
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
to achieve your goal, you need to use back reference to record the part match "Heading", then use $1 in replacement string.
To extract "Heading" from <H1 att1 =\"test\" >Heading</H1>, you need a reg
to match the string, but put a pair of parenthesis around the reg
that matches "Heading" <- it would be remembered as $1.
the operation could be,

HTH
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Chi,
I did not understand what you did. I dont want to diplay the attributes also. if I have a
String test1 = "hello all <font face= \"verdana\" > regular expression </font>";
I want to display "hello all" . can this be done with regualar expression or should I use StringTokenizer and scan each one seperately?
Thanks
Adithi
 
chi Lin
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
adithi,
What i did is divide the whole reg into sub groups so we can focus on each sub group.
In your case, Reg is better than StringTokenizer (I think), the thing is
you need to match the context before you can preserve something in the middle.
ie we use reg to match <H1 att1 =\"test\" >Heading</H1>, then preserve "Heading" using back reference $1.
on my previus post, things between "some reg" is the expression you put
in to match part of the string.
After the match is achieved, we use $1 that came from the process to replace the match. so,
<H1 att1 =\"test\" >Heading</H1>, become "Heading" and
test become "Hello Heading"
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Chi,
BUt when you say $1 doesnt that mean only the last string will be stored.
If I had
<H1 att1 =\"test\" >Heading TTTT </H1>
with the regular expression you suggested would the output be Heading TTT
Thanks
 
chi Lin
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, the answer would be "Heading TTTT" (4T here).

BUt when you say $1 doesnt that mean only the last string will be stored.
If I had
<H1 att1 =\"test\" >Heading TTTT </H1>
with the regular expression you suggested would the output be Heading TTT
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!