This week's book giveaway is in the Spring forum.
We're giving away four copies of Spring in Action (5th edition) and have Craig Walls on-line!
See this thread for details.
Win a copy of Spring in Action (5th edition) this week in the Spring forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Devaka Cooray
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Knute Snortum
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Ganesh Patekar
  • Frits Walraven
  • Tim Moores
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Stephan van Hulst
  • salvin francis
  • Tim Holloway

Help with selecting elements/data from a html document.  RSS feed

 
Ranch Hand
Posts: 31
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What I am trying to do?

Make a project that will get data from a wikipedia page that will be used to instantiate a Country class object which I will use to insert data into a database table of Countries.

Whats the problem?

Well I have used jsoup to extract the list of countryNames from wikipedia and with simple string concatenation (https://en.wikipedia.org/wiki/ + countryName) to iterate country pages and extract data.

The problem I have is that the data from the pages changes, sometimes it is not where I expected it to be.  My idea was to use webdriver and with help of By.xpath to select elements, this will not work out since I have already ran into a problem:

Religion in France can be found in the side table(the class="infobox geography vcard" one) but for Algeria it is not so easy, as the same xpath would give me the type of government.

But if I go to the religious part of the wikipage for France I will end up with France being a secular country instead of getting the list of religions in France. While for Algeria that is where I will find the religions since they are not listed in the side table(the class="infobox geography vcard" one).



Any idea how I should go on about this? Give up on webdriver and find some library like jsoup to extract data? Still i do not see a way around the difference between pages for different countries.


 
Rancher
Posts: 3314
33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

do not see a way around the difference between pages for different countries.  


Yes that will be a problem.  The authors of the pages were not consistent with the layouts making it difficult for poor programmers to scrape the pages.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!