• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

Parsing HTML using Java

 
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I have a requirement of parsing an HTML page and pulling out a text from a specific HTML tag. This is the first time I am working on this. I am able to read the Tags and their Id's and also the complete text on the page but have no idea how to read the text enclosed in a specific tag. I have written my code below. I want to grab the text within <td id="dept1">Sales</td> only i.e., "Sales" in this case. Please help me.





--
Mazhar

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]
[ October 09, 2008: Message edited by: Mazhar Ismail ]
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would guess that you need to override the "handleText" method.
 
Mazhar Ismail
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
tried overriding.i guess i am doing it wrong.any example how to do it.

Thanks,
Mazhar
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How did you try (post the relevant code excerpt)? Was the method called? If so, what values did the parameters have?
 
Ranch Hand
Posts: 1179
Mac OS X Eclipse IDE
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A HTML page is basically a XML document - you could try parse the HTML page using a DOM or SAX parser.

Java API for XML Code Samples
 
Sheriff
Posts: 22815
132
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just remember that handleText is not required to handle all the text in a node in one go. Use StringBuilder to combine it; you can finish it in the handleEndTag method.
 
Rob Spoor
Sheriff
Posts: 22815
132
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Rene Larsen:
A HTML page is basically a XML document


If you're lucky. HTML allows nesting of tags, missing end tags, missing quotes around attributes, and much more that is not allowed in XML.ent.

That's why XHTML is invented. It's basically HTML that truely is XML. For instance, it requires <br> to be ended: <br />.
[ October 10, 2008: Message edited by: Rob Prime ]
 
I suggest huckleberry pie. But the only thing on the gluten free menu is this tiny ad:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic