Hello ranchers,
I am developing an application/web service for RSS feed generation from HTML pages listing news stories. At this stage, I am looking into HTML scraping for extracting the necessary information.
With the help of an online
example I can get the links in an HTML document using javax.swing.text.html.
For the purposes of this post I used
http://www.slashdot.org as a reference news page. Now, looking at the console output of
document.dump(System.out) I am getting quite confused when it comes to extracting the news title, content, author, etc. I understand that I am looking for a way to get the [beginning, end] values for each news title, story, author. I tried iterating through to DIV tags and their CLASS attributes but I didn't get anywhere. It rather felt like walking in the dark.
Any examples/documentation on how to use the parser to step into the content in such a way would be much appreciated.
NOTE: As I am only just beginning this I'm open to suggestions on using other HTML parsers if they are easier to manipulate.
Thanks in advance.