• Post Reply Bookmark Topic Watch Topic
  • New Topic

removing tags from a string....  RSS feed

 
Chris Parkinson
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi i'm really hoping someone could help me out here!
I have a string, which contains HTML tags in it.
I would like to ignore all the tags (including whats inside them) and just end up with all the other characters in a new string.
Is there anyway of doing this?
Thanks
 
john smith
Ranch Hand
Posts: 75
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you know your HTML will be well-formed (which unfortunately isn't implicit in HTML) you could just parse it as XML.
You do also have access to the java.util.regex as of JDK1.4 - you'd probably be able to knock something up using this.
 
Chris Parkinson
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK thanks
So for example if my string is:
s1 = { <body> a number of words </body> }
I want to create a new string from it containing:
s2 = { a number of words }
I think i need to use regular expressions. So when reading down the string I write all characters untill i reach a '<'. Then I carry on reading only until I reach a '>' After this I write to the new string all characters, until again i reach a '<'.
Could anybody possibly help me implement this. I'm not sure how to read the string in the first place.
 
Chris Parkinson
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Think someone else has just posted something similar.
I'll try and understand that
http://www.coderanch.com/forums/
 
Michael Dunn
Ranch Hand
Posts: 4632
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Try this
 
Chris Parkinson
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for you reply,
Now if I have a string:
s1 = "some words @;word@; some more words"
I want to create a new string:
s2 = "some words some more words"
So in summary I want to ignore, @; and its contents.
I've been looking at java.util.regex and I'm still confused.
Would i be correct in trying to match the pattern: @; * @;
I know this is not correct, how do I match the pattern:
@; (0 or many characters) @;
Here are the repetition characters for the package
? Matches the preceding element zero or one times.
+ Matches the preceding element one or more times.
* Matches the preceding element zero or more times.
{n} Matches the preceding element n number of times.
{min,max} Matches the preceding element a specified number of
times from min to max inclusive.
{min,} Matches the preceding element min or more times
Any help will be greatly received
[ March 15, 2004: Message edited by: Chris Parkinson ]
 
Dirk Schreckmann
Sheriff
Posts: 7023
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Moving this to the Intermediate forum...
 
Chris Parkinson
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hope this might help:
To match any single lowercase letter of the english alphabet, it's possible to specify such a pattern as:
" [abcdefghijklmnopqrstuvwxyz]"
Fortunately, a few "shortcut" regular expression constructs are available. To match a range of characters, the "-" character can be used. So, "[a-z]" describes the same range of characters above.
The regular expression syntax allows two styles to describe the union of two or more character classes. "[a-c[f-h]]" and "[a-cf-h]" both describe the character class "[abcfgh]".
Other "shortcut" constructs include these often used predefined character classes3:
. matches a single character (may or may not match line terminators)
\d matches a digit: [0-9]
\D matches a non-digit: [^0-9] *
\s matches a whitespace character: [ \t\n\x0B\f\r] (see footnote on characters)
\S matches a non-whitespace character: [^\s] *
\w matches a word character: [a-zA-Z_0-9]
\W matches a non-word character: [^\w] *
* "^" is the NOT operator and is covered further down on this page.
 
Leslie Chaim
Ranch Hand
Posts: 336
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Would i be correct in trying to match the pattern: @; * @;
No! The quantifiers govern the preceding expression which in your case is a space \040 octal
I know this is not correct, how do I match the pattern:
@; (0 or many characters) @;

If you wanna master regex than you must always ask why is this not correct?
Anyway, the pattern you need is:
@;.*?@;
The '.' matches any character which is governed by the '*' quantifier that says match the preceding any number of times the '?' modifies the '*' so that it's non-greedy and does a lazy match so it will not gobble up to the end of string.
HTH
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!