• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

jdbc - common problems/mistakes

 
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hi,

i wrote a crawler which runs fine (now). It uses a HashSet where it stores a URL like http://www.google.com, creates a table from the host (www.google.com) and inserts the link with a field "isProcessed" which is a boolean.
My problem is that the crawler gets pretty slow when i get up to ~1 million links, so i started the profiler and found out that the bottleneck is VisibleBufferedInputStream.readMore(int) which is part of the JDBC postgresql driver.

I use some selects to check if the link i'm currently checking is already listed in the db to prevent duplicates. Maybe someone can give me some hints like "dont use select * from... use select field1, field2... from..."
I have no idea how i could tune the database and it would really nice if it runs like in the beginning when i start my application.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Do you have indexes in place that get used for those SELECTs? You may want to familiarize yourself with Postgres' EXPLAIN functionality that shows you the execution plan for the query in question.
 
author & internet detective
Posts: 42103
933
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Olze,
It's hard to give tips on a hypothetical situation. Can you share the SQL if you want comments on it? And as Ulf noted, the explain plan will show you where the time is going. I blogged on how to do this in postgresql.
 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thanks for that hint with explain.

first i start with a link (get it from the db or as command line parameter) and check if the corresponding table exists:


if it doesnt, i create the table


set the link as busy


download the link and set as downloaded


then i extract all links from that downloaded html and insert them:


last, i set the processed field to true so i can exclude those from my next query


i just read that its possible to enable time tracking in the pg logs - like a verbose mode
i will try this, maybe i can find the bottleneck


edit: i forgot, i create an index after creating the table:

 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As i cannot use PreparedStatements, i decided to use plpgsql to create procedures which dont have to be compiled all the time. Because i saw in the logs that a lot of time gets wasted during parsing the statements, and they only change slightly (tablename and link). Or is this a bad idea??
 
Ranch Hand
Posts: 1491
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What is crawler ? Is it open source tool ?
 
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Creating a table and index for each host seems inefficient. Also, it seems you could save an update if you marked the processed field true on insert. Another benefit of using one table for all hosts is that you could more easily use a prepared statement.
 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
that crawler is not opensource. if it would be, i would have posted the source ;)

that crawler has to handle really a lot of links and hosts - its not possible to do it in any other way.
a link is processed when it is downloaded and processed, not before these 2 steps.
 
Think of how stupid the average person is. And how half of them are stupider than that. But who reads this tiny ad?
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic