[Logo]
Forums Register Login
get xml's cdata using saxparser
hi.i'm building an rss reader and i want to get the text-content of cdata.I've managed to get whatever is contained inside cdata but i'm only intersted in the text without for example links,src images, greater-than/less-than signs.I've tried to do that to some point using regex, but that becomes complex.Is there another way to do something like that, or regex is the only solution?
Hi Tasos, welcome to the Ranch!

Am I correct in guessing that what you get out of the CDATA is some kind of HTML data? And that it isn't necessarily well-formed HTML?

If so, then regex isn't going to be very useful in extracting the text and discarding the markup. Regular expressions don't work well with languages with recursive grammars like HTML. So what I suggest is that you should get an HTML parser and parse the contents of the string. Then extract only the text nodes from the parsed HTML and discard everything else.

 
Paul Clapham wrote:Hi Tasos, welcome to the Ranch!

Am I correct in guessing that what you get out of the CDATA is some kind of HTML data? And that it isn't necessarily well-formed HTML?

If so, then regex isn't going to be very useful in extracting the text and discarding the markup. Regular expressions don't work well with languages with recursive grammars like HTML. So what I suggest is that you should get an HTML parser and parse the contents of the string. Then extract only the text nodes from the parsed HTML and discard everything else.



I've managed to do it for now with regex besides my cdata comes from an rss and there are only a couple of links and pics.Thanks for your help.
Yes, if you're only getting simple and predictable data then you can make a regex work. But later if you find the data is not as predictable, or it is more complex, you may find that you can't make a working regex any more.
Wink, wink, nudge, nudge, say no more ... https://richsoil.com/cards



All times above are in ranch (not your local) time.
The current ranch time is
Nov 21, 2017 16:29:46.