Win a copy of Succeeding with AI this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
  • Junilu Lacar
Sheriffs:
  • Tim Cooke
  • Jeanne Boyarsky
  • Knute Snortum
Saloon Keepers:
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:
  • salvin francis
  • fred rosenberger
  • Frits Walraven

How to do web-scraping when fetch API is involved?

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using Java, I am trying to scrape the folliwing website: https://pubchem.ncbi.nlm.nih.gov/compound/1869 and obtain from it the target string: CNC1=C2C(=NC=N1)N(C=N2)C3C(C(C(O3)CO)O)O.  This string is found under the section called "Canonical SMILES" though such a section, nor the target string appear in simply viewing the HTML source code due to it being dynamically loaded upon scrolling down to that section. The problem is that this is a dynamically loaded webpage (I believe that is JavaScript driven). I tried to use HTMLUnit and for a couple days, it worked until the webmasters inserted a fetch API and now HTMLUnit doesn't work (it returns an error that states that it doesn't recognize the API) - I've even upgraded to the most recent version of HTMLUnit to no avail. I next thought to use JSoup for this operation, but I have learned that it cannot handle JavaScript and so my question then is:

How can I scrape the above website to collect the string I am looking for? Furthermore, I will need to fetch additional items from another section (of the same website) called "Computed Properties" which is in a table format by the looks of it. Are there any recommendation regarding specific tools, that are headless (it must be headless, no GUI), that can handle this scenario?
 
Saloon Keeper
Posts: 6389
158
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The site has a bunch of download options for various formats - does none of those contain the data you're looking for in an easier-to-handle format?
 
Oscar Bastidas
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I frankly did not know there was a downloadable version of the website.  I see the button where to download.  The problem is that I can't even access the page using the aforementioned HTMLUnit and JSoup doesn't work with the JavaScript aspect of the page.  So I can't even summon the page programmatically to click the "Downliad" button.  Sorry if the question seems naive but my backround in using Java is for number crunching, this is my first foray into the web side of things.  Thanks for the reply.
 
Oscar Bastidas
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:The site has a bunch of download options for various formats - does none of those contain the data you're looking for in an easier-to-handle format?



I do indeed see how I can append a string to the end of the original URL to obtain an HTML-only page containing the XML.  Thanks for pointing the downloadability of the page!  What I need is exactly there.  My only challenge now is that I'd like to have all of this done programmatically, that is, searching the database using their textbox search for the chemical entry (the search takes a chemical name in English and returns a number as part of the URL that corresponds uniquely to that chemical name in English), and then parse the resulting html page for the data I want (this I feel is the easier part with my present knowledge of Java).  With this development, I'll try again to see if I can execute the search using HTMLUnit.  Thanks again!
 
Saloon Keeper
Posts: 22001
151
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually, I do my page scraping in Python. There's a package called beautifulsoup that is very useful for picking apart html and XML.

I'm not happy with how they did their download button there. There is a time and a place for clever, and this isn't it. A simple old-style button-shaped hyperlink would be much more appropriate.
 
Oscar Bastidas
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Holloway wrote:Actually, I do my page scraping in Python. There's a package called beautifulsoup that is very useful for picking apart html and XML.

I'm not happy with how they did their download button there. There is a time and a place for clever, and this isn't it. A simple old-style button-shaped hyperlink would be much more appropriate.

Thanks for your reply!  As it turns out, if we add a string of characters in the middle and end of the URL, it summons a straight-forward web page with the info I'm chasing.  This means I don't have to fiddle with their "Download" button - but I still need to execute a search (I can only imagune that their "Search" button might be just as convoluted).

Out of curiosity, based on how the "Download" button looks, do you judge it to be a relatively straight-forward process to submit a download request using beautiful soup?
 
Tim Holloway
Saloon Keeper
Posts: 22001
151
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Beautiful Soup doesn't download. It relies on the stock Python http download functions. All Beautiful Soup does is digest the downloaded data into a DOM that can be navigated and plundered. You can, of course, do this using Java's own DOM services, but the Java version is messier and, of course, requires the extra compile-and-build steps. Which I prefer to skip when trying out different ways to dissect a document.
 
    Bookmark Topic Watch Topic
  • New Topic