This week's book giveaway is in the Kotlin forum.
We're giving away four copies of Kotlin for Android App Development and have Peter Sommerhoff on-line!
See this thread for details.
Win a copy of Kotlin for Android App Development this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Devaka Cooray
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Junilu Lacar
  • Paul Clapham
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • salvin francis
  • Carey Brown
Bartenders:
  • Tim Holloway
  • Frits Walraven
  • Ganesh Patekar

Help with selecting elements/data from a html document.  RSS feed

 
Ranch Hand
Posts: 31
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What I am trying to do?

Make a project that will get data from a wikipedia page that will be used to instantiate a Country class object which I will use to insert data into a database table of Countries.

Whats the problem?

Well I have used jsoup to extract the list of countryNames from wikipedia and with simple string concatenation (https://en.wikipedia.org/wiki/ + countryName) to iterate country pages and extract data.

The problem I have is that the data from the pages changes, sometimes it is not where I expected it to be.  My idea was to use webdriver and with help of By.xpath to select elements, this will not work out since I have already ran into a problem:

Religion in France can be found in the side table(the class="infobox geography vcard" one) but for Algeria it is not so easy, as the same xpath would give me the type of government.

But if I go to the religious part of the wikipage for France I will end up with France being a secular country instead of getting the list of religions in France. While for Algeria that is where I will find the religions since they are not listed in the side table(the class="infobox geography vcard" one).



Any idea how I should go on about this? Give up on webdriver and find some library like jsoup to extract data? Still i do not see a way around the difference between pages for different countries.


 
Rancher
Posts: 3353
33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

do not see a way around the difference between pages for different countries.  


Yes that will be a problem.  The authors of the pages were not consistent with the layouts making it difficult for poor programmers to scrape the pages.
 
All of the world's problems can be solved in a garden - Geoff Lawton. Tiny ad:
RavenDB is an Open Source NoSQL Database that’s fully transactional (ACID) across your database
https://coderanch.com/t/704633/RavenDB-Open-Source-NoSQL-Database
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!