File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes HTML Codes Parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "HTML Codes Parser" Watch "HTML Codes Parser" New topic

HTML Codes Parser

Jigar M Gohil

Joined: Dec 14, 2011
Posts: 25
Hi All,
I have a simple String data (NOT XML) which might contain HTML special Characters ( like & or & ).
I am looking for a parser which can scan the input string for such codes and replace them with corresponding Special characters.

Thanks in advance!!!
Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

Not XML? Let's move this to the not-XML forum, then. Maybe it will get more exposure in a general Java forum.
Rob Spoor

Joined: Oct 27, 2005
Posts: 20279

You can just use java.util.regex.Pattern and java.util.regex.Matcher for this. Create a Pattern for the place holders (&.+?; - the .+? is a non-greedy catch-all), look for all occurrences (as long as the Matcher's find() method returns true), investigate the match and if it's one you're looking for, replace it. You can use Matcher's appendReplacement and appendTail to finalize your String. In a bit of pseudo code:

How To Ask Questions How To Answer Questions
g tsuji
Ranch Hand

Joined: Jan 18, 2011
Posts: 633
In the case where we know beforehand that only a limited number of possible html entities may appear, a regex approach may do just fine. But often time, as the complete set of html entities is big, appeal to some library/utility class seems necessary.

For the functionality sought after, in Perl, say, there is HTML::Entities module to help. In java, we can, for instance, call upon org.apache.commons.lang.StringEscapeUtils to help. For a quite arbitary but valid html case study it may go like this.
I agree. Here's the link:
subject: HTML Codes Parser
It's not a secret anymore!