This week's book giveaway is in the Java in General forum. We're giving away four copies of Helidon Revealed: A Practical Guide to Oracle’s Microservices Framework and have Michael Redlich on-line! See this thread for details.
I'm looking for sample codes which remove all html tags from a html document and return plain-text content only. That codes should replace <br> or tags with "\n". Please help.
If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries. Alternatively, you could use regular expressions to search and replace angle brackets.
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).