• Post Reply Bookmark Topic Watch Topic
  • New Topic

Java HTML parser  RSS feed

 
Tim West
Ranch Hand
Posts: 539
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,
I'm posting this here not in the XML or HTML forums because it relates to Java parsing libraries rather than XML or HTML structure/syntax etc.
My question is, what is a good free (ideally open source) HTML parsing library? I want to pull data off the net and run through it for certain data. I've considered:
  • Apache Crimson (behind SAX). This is no good because it's too strict - if tags don't match up (as they so often don't in HTML), it barfs.
  • The HTML parser in javax.swing.text.html.parser, but this isn't suitable - for example it can't handle lowercase letters in tags (Ie <a> not <A> .
  • Using regexps rather than a parser to find my data. But this quickly becomes absurdly difficult if what I'm looking for is remotely complex.


  • So, does anyone have some tips on either (1) how I can either make Crimson more error-tolerant, (2) what other library I should be using?
    Cheers all,
    --Tim
     
    Tim West
    Ranch Hand
    Posts: 539
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    An addition...I've just found Xerxes-J. Anyone know if this is appropriate? I'm guessing it's quality given it's an Apache project...
    [Update - 5 mins later]
    It's no good (for what I want to do)...to quote their "common problems" section...

    Unfortunately, HTML does not, in general, follow the XML grammar rules. Most HTML files do not meet the XML style quidelines. Therefore, the XML parser generates XML well-formedness errors.
    (...)
    HTML must match the XHTML standard for well-formedness before it can be parsed by Xerces-J or any other XML parser. You can find the XHTML standard on the W3C web site.

    Now I'm looking at Jericho (http://sourceforge.net/projects/jerichohtml/)...sounds like there's some potential there.
    -Tim
    [ April 02, 2004: Message edited by: Tim West ]
    [ April 02, 2004: Message edited by: Tim West ]
     
    Stan James
    (instanceof Sidekick)
    Ranch Hand
    Posts: 8791
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I use the Quotix Parser. There are others out there, probably better supported. I liked this one because it supports the Visitor Pattern in a way that made my life very easy.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!