• Post Reply Bookmark Topic Watch Topic
  • New Topic

Extract the content(text) of a url  RSS feed

 
Raihan Jamal
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How can I extract the text from a url. In my code code it is extracting the source code..




Any suggestions??
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Something like NekoHTML, TagSoup, HtmlCleaner or jTidy would be well suited for this, as would be HtmlUnit on a higher level.
 
Raihan Jamal
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Ulf Dittmer, with Tika is there any way.. As I was able to do that with the url connection but I am not using url here as url need authentication..
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not following. If the URL needs authentication, then ALL methods of access need to deal with that, so what's the difference?
 
Raihan Jamal
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Previously I was working with this way in some other project.. So by this way I was able to get the content back.. So how can we employ this same strategy in my code...


 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using any one of the libraries I mentioned before makes it comparatively easy to access different parts of an HTML page.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!