If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries. Alternatively, you could use regular expressions to search and replace angle brackets.
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Chinh Tran Nam
posted 15 years ago
I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).