• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

parsing html

 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hello

I seem to be really bad at searching the web so I'm going to throw myself at the mercy of you guys.

Here's what I'd like to get done -
there's a web site that I like; I want to extract information out of certain pages from it.
There are lots of pages, but the content is always embedded in a certain tag.

Now I can certainly do this just with the java.net classes and some string parsing. But I feel I should use a more elegant approach.

Could I get advice on a popular software and maybe a quick pointer and which calls I need to make?

Thanks a million for any answer!
 
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
copy and paste?

If you want to write your own software, you really do have to walk through every line to look for the opening and closing tags. Good algorithms are rarely elegant. Number crunching and parsing through garbage is always ugly. Just be careful when you write it so you can understand your own code.

Wish I could be more helpful.
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The java.net stuff is the right track to retrieve the page contents. Use a URL to get an HttpURLConnection, read the content into a string buffer of some kind.

Fortunately the tricky work of parsing has largely been done for you. See the javax.swing.text.html.parser.Parser in the JDK. I also like the Quiotix HTML Parser because it has a neat visitor interface and can reproduce the HTML from the DOM. Google for other "java html parser" kits.

All that may be overkill if you just need to substring the text between <tag> and </tag>
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
NekoXNI is actively maintained. You can leverage HttpUnit to do the work for you: it can retrieve the page, reformat the page into valid HTML, and then gives you access to the page elements. It uses NekoXNI internally.
Alternativel, if you know the precise formatting of the page, and it's a particular tag you're looking for, you could extract the interesting part with a regular expression.
[ July 20, 2005: Message edited by: Ulf Dittmer ]
 
Ricky Gentry
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I didn't know that existed.

You learn something new everyday.
reply
    Bookmark Topic Watch Topic
  • New Topic