• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Liutauras Vilda
  • Campbell Ritchie
  • Tim Cooke
  • Bear Bibeault
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Knute Snortum
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Ganesh Patekar
  • Stephan van Hulst
  • Pete Letkeman
  • Carey Brown
Bartenders:
  • Tim Holloway
  • Ron McLeod
  • Vijitha Kumara

Help with selecting elements/data from a html document.  RSS feed

 
Ranch Hand
Posts: 31
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What I am trying to do?

Make a project that will get data from a wikipedia page that will be used to instantiate a Country class object which I will use to insert data into a database table of Countries.

Whats the problem?

Well I have used jsoup to extract the list of countryNames from wikipedia and with simple string concatenation (https://en.wikipedia.org/wiki/ + countryName) to iterate country pages and extract data.

The problem I have is that the data from the pages changes, sometimes it is not where I expected it to be.  My idea was to use webdriver and with help of By.xpath to select elements, this will not work out since I have already ran into a problem:

Religion in France can be found in the side table(the class="infobox geography vcard" one) but for Algeria it is not so easy, as the same xpath would give me the type of government.

But if I go to the religious part of the wikipage for France I will end up with France being a secular country instead of getting the list of religions in France. While for Algeria that is where I will find the religions since they are not listed in the side table(the class="infobox geography vcard" one).



Any idea how I should go on about this? Give up on webdriver and find some library like jsoup to extract data? Still i do not see a way around the difference between pages for different countries.


 
Master Rancher
Posts: 3276
33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

do not see a way around the difference between pages for different countries. 


Yes that will be a problem.  The authors of the pages were not consistent with the layouts making it difficult for poor programmers to scrape the pages.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!