• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Indexing dynamically created pages (Javascript) and Flash contents

 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello folks
I need to extract some data from complex websites, which pages are mostly generated by javascript and also to index some Flash content in those sites. What approach do you suggest? I am very comfortable with Java programming: is there any web-scraping framework written in Java? Is it necessary to embed some web browser (Mozilla)?

Best regards =)
 
Sheriff
Posts: 22784
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can you read a single web page? If so the next step is taking its contents and parsing it. Do a search in this forum, Other JSE/JEE APIs and Java in General for information on how to parse a web page. You need all src and href attributes to start with, possibly others as well. I once wrote a link checker that recursively could check the links on a web site, so it shared a basic principle - take a web page and retrieve all links from it. Your program just needs to download them all.
 
Raffaele Sgarro
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
unfortunately my problem is not so simple
My HTML document is simple a bunch of <script> tags and some <object>s... The page is generated from javascript in "browser space", so I need some sort of JavaScript engine capable of creating and manipulating the DOM; then I should parse the DOM objects, rather than the pure html.
Consider a website consisting of a single html document mysite.com/index.php
All <a href="javascript:void(0)">s in that document are bounded to some javascript function, so that the navigation actually happens to be a sequence of asynchronous calls. I need an engine capable of execute that code... Mozilla? Any experience with (XUL? XPCOM?) bindings?
Also, there are page made of a single Flash (swf) GUI... How do I interact with it? How do I navigate through its "menus" and finally retrieve the information I need?
 
He puts the "turd" in "saturday". Speaking of which, have you smelled this tiny ad?
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic