• Post Reply Bookmark Topic Watch Topic
  • New Topic

Understanding the sockets flow in Java

 
Guy Yafe
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
I am starting to write a small crawler and wanted to consult about some issue.
More specifically, the crawler is supposed to crawl to weather sites and extract weather data. The base idea is to crawl to a weather site, extract the list of cities from it, and for each city crawl to its page and extract the spcific city's weather data.
I am interested in making the crawler as most asynchronous as possible, meaning never to wait on a blocking function. The basic design is as followed:
Have a thread pool of workers.
Each worker handles an async task which never blocks.

First task: "Download main site page, and put second task in queue".
Second task: "If page has been downloaded, parse the list of cities and for each city put third task in queue. If page has not been downloaded yet, return to queue".
Third task: "Download the citie's page, and put fourth task in queue".
Fourth task: "If city's page has been donloaded, parse its weather data and put it in a data structure. If the page has't been downloaded yet, return to queue."

Of course some failure and timeout mechanisms should be implemented, but they aren't relevant yet.
This design should promise max CPU utilization and as little waiting as possible.

I thought of using Java NIO package, and use a SocketChannel and a selector that will tell me if the page is ready. But what is happening unser the SocketChannel's hood? Where is the downloading mechnism being carried out?
If the HTTP call is carried out somwhere under the OS's responsibility, everything is fine. The JVM is free for the next task.
But if the JVM itself divide's the HTTP request into TCP packets, and handles the entire flow in the TCP layer, things are much compilcated. In order to achieve more utilization I should handle it myself, including dividing the request into packets, carry out the negotiation part, sending packets, receiving ACKs, receiving data and sending ACKs, rebuilding packets and closing connection.

So the question is how exactly the JVM works? Is it a good idea to consider the NIO flow which works above the TCP layer as a black box, or should I look into better resolution?

Thanks,
Guy
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If this was my problem, I'd use a library like HtmlUnit for the web page access and data extraction rather than implementing all that myself.

As to the task separation, it seems that task 1 and 2 would be the main program rather than actual tasks, and task 3 and 4 would be part of the same task, rather than separated into two.
 
Guy Yafe
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Task one involves downloading a page and thus, must wait until the page is downloaded. This is why I sepparate it from task two.
Same for the separation between tasksk 3 and 4.
Tasks 2 doesn't include waiting and therefor can be combined with task 3. The reason I separate them is that there are many cities to crawl so I want to spread them between several queues.
BTW the main program crawls between many wether sites (for each country in Europe and North America), so I consider each site as a task.

Regarding HTMLUnit: This is actually the first time I hear about it, and it looks promising.
There are two issues: I wonder if there is much overhead of using an entire browser, including downloading Javascript, CSS (but if it is GUI-less maybe it doesn't download CSS), rendering (again GUI-less may bot render the page) and running JS.
The second issue, is that I want to have max performance by never waiting for anything, so I don't know if I can accomplish that with HTMLUnit.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Combining task 2 and 3 doesn't make sense if I understand it correctly - only by executing task 2 is the data required to start task 3/4 being discovered so there's a dependency, whereas all the tasks 3/4 are independent of one another.

HtmlUnit is highly configurable as to what it does and does not download. If a site does not need JavaScript -unlikely these days, but not impossible- then that can be turned off.

Not sure what you mean by the second issue, maybe you can elaborate.
 
Guy Yafe
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Exactly. this is why I don't combine tasks 2 and 3.
Combinig them would yield something like:
Take the (already) downladed page, parse the list of cities from it (task 2) and for each city, download its own page (task 3).
The current design parses the list of cities and then creates many tasks that each will in turn download the city's page. With this design I can also spread the cities' tasks between several threads.

Answering your question:
My main goal is to work in anasynchronous programming method, and have the program as concurrent as possible.
Basically I don't want any thread to wait for any network response. This is why I separate tasks 1 and 2 (same for 3 and 4).
Task 1 is creating a socket that will download the page and then putting task 2 in the queue. This is the part of the waiting I want to avoid: The socket will wait for a long time until the entire page is downloaded. I want the worker to to other things while we wait for the download.
While the page being downloaded, the worker will switch to other tasks, like downloading other pages. Eventually, the worker will extract the above task 2 from the queue. Task 2 will first check if the page has already been downloaded, and if so, it will start pasring the cities and preparing task(s) 3.

Hope this explanation is better.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!