• Post Reply Bookmark Topic Watch Topic
  • New Topic

Help with selecting elements/data from a html document.  RSS feed

 
Ranch Hand
Posts: 31
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What I am trying to do?

Make a project that will get data from a wikipedia page that will be used to instantiate a Country class object which I will use to insert data into a database table of Countries.

Whats the problem?

Well I have used jsoup to extract the list of countryNames from wikipedia and with simple string concatenation (https://en.wikipedia.org/wiki/ + countryName) to iterate country pages and extract data.

The problem I have is that the data from the pages changes, sometimes it is not where I expected it to be.  My idea was to use webdriver and with help of By.xpath to select elements, this will not work out since I have already ran into a problem:

Religion in France can be found in the side table(the class="infobox geography vcard" one) but for Algeria it is not so easy, as the same xpath would give me the type of government.

But if I go to the religious part of the wikipage for France I will end up with France being a secular country instead of getting the list of religions in France. While for Algeria that is where I will find the religions since they are not listed in the side table(the class="infobox geography vcard" one).



Any idea how I should go on about this? Give up on webdriver and find some library like jsoup to extract data? Still i do not see a way around the difference between pages for different countries.


 
Rancher
Posts: 2240
28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
do not see a way around the difference between pages for different countries. 

Yes that will be a problem.  The authors of the pages were not consistent with the layouts making it difficult for poor programmers to scrape the pages.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!