Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Retrieve HTML text element HTMLCleaner

 
Lex van Rijswijk
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm trying to get a text from a website to show in an Android app. I'm using HTMLCleaner for parsing the HTML code. I dont have much knowledge of HTML but the code seems a bit messy in my opinion. I've read quite a few examples and other topics but I just cant get it to work. My code:



The part of the HTML code I'm trying to retrieve is "Welkom" and "Havana staat voor eten, drinken en dansen in een gezellige sfeer." (Dutch). Part of the HTML code:



I've tried many different setups of XPATH_HOME but everytime my string "HomeTitle" returns empty. I've tried to start XPATH_HOME from <div id="content"> but I've read that its best to stay as close as possible to the element you want to retrieve because of code updates and site adjustments. So what should be the XPATH to the desired text?

Hopefully you can help me out!
Thanks
 
Mario Alcantara
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is the complete HTML content of the page? It contains a <html> or <body> tag?
 
Lex van Rijswijk
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Mario,

It has a <body> tag which is inside <html xmlns="http://www.w3.org/1999/xhtml"> and </html>
What I found out, as soon as you load the www.havana-tilburg.nl page, it first shows an intro. After that it gets to the home page but they both have the same address. Not sure if that is going to be a problem?

I also read that HTMLCleaner cant go that deep into a tree. So I've tried Jsoup and HTML parser. They have the same result though. And that is 'nothing'.

Gr
 
Mario Alcantara
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've test your XPath expression and it's correct, I suposse that problem is the content of the page, you can see in the content of root element TagNode using root.getText().toString(), the result must contain the next structure:

 
Lex van Rijswijk
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for looking into it. I tried to put the content in a string before but I don't know a good way to show the text.
Is it possible to put the string with the content in a XML file in eclipse? Or how do I make it readable like you show in your last reply?
If I put it in a string I can show it in the android app obviously but that isnt very useful i guess.

Can also you please explain why you asked me if the content was in a HTML tag or a BODY tag?
Gr

 
Mario Alcantara
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I asked you for <body> or <html> tags because I thought that you xpath expression was wrong, but I checked it and it´s fine. On the other hand to order of show html content of the page requested is verify if it contains the right structure, if the html content is very very different you never get the correct tag
 
Lex van Rijswijk
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mario, thanks for your help. I think I'm gonna try some different websites and see if I get the right information from those. Probably will back in a few days!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic