Win a copy of Machine Learning with R: Expert techniques for predictive modeling this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Help with selecting elements/data from a html document.

 
Ranch Hand
Posts: 31
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What I am trying to do?

Make a project that will get data from a wikipedia page that will be used to instantiate a Country class object which I will use to insert data into a database table of Countries.

Whats the problem?

Well I have used jsoup to extract the list of countryNames from wikipedia and with simple string concatenation (https://en.wikipedia.org/wiki/ + countryName) to iterate country pages and extract data.

The problem I have is that the data from the pages changes, sometimes it is not where I expected it to be.  My idea was to use webdriver and with help of By.xpath to select elements, this will not work out since I have already ran into a problem:

Religion in France can be found in the side table(the class="infobox geography vcard" one) but for Algeria it is not so easy, as the same xpath would give me the type of government.

But if I go to the religious part of the wikipage for France I will end up with France being a secular country instead of getting the list of religions in France. While for Algeria that is where I will find the religions since they are not listed in the side table(the class="infobox geography vcard" one).



Any idea how I should go on about this? Give up on webdriver and find some library like jsoup to extract data? Still i do not see a way around the difference between pages for different countries.


 
Rancher
Posts: 3445
33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

do not see a way around the difference between pages for different countries.  


Yes that will be a problem.  The authors of the pages were not consistent with the layouts making it difficult for poor programmers to scrape the pages.
 
Who among you feels worthy enough to be my best friend? Test 1 is to read this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!