Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

jdbc - common problems/mistakes

 
olze oli
Ranch Hand
Posts: 149
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi,

i wrote a crawler which runs fine (now). It uses a HashSet where it stores a URL like http://www.google.com, creates a table from the host (www.google.com) and inserts the link with a field "isProcessed" which is a boolean.
My problem is that the crawler gets pretty slow when i get up to ~1 million links, so i started the profiler and found out that the bottleneck is VisibleBufferedInputStream.readMore(int) which is part of the JDBC postgresql driver.

I use some selects to check if the link i'm currently checking is already listed in the db to prevent duplicates. Maybe someone can give me some hints like "dont use select * from... use select field1, field2... from..."
I have no idea how i could tune the database and it would really nice if it runs like in the beginning when i start my application.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you have indexes in place that get used for those SELECTs? You may want to familiarize yourself with Postgres' EXPLAIN functionality that shows you the execution plan for the query in question.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34973
379
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Olze,
It's hard to give tips on a hypothetical situation. Can you share the SQL if you want comments on it? And as Ulf noted, the explain plan will show you where the time is going. I blogged on how to do this in postgresql.
 
olze oli
Ranch Hand
Posts: 149
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks for that hint with explain.

first i start with a link (get it from the db or as command line parameter) and check if the corresponding table exists:


if it doesnt, i create the table


set the link as busy


download the link and set as downloaded


then i extract all links from that downloaded html and insert them:


last, i set the processed field to true so i can exclude those from my next query


i just read that its possible to enable time tracking in the pg logs - like a verbose mode
i will try this, maybe i can find the bottleneck


edit: i forgot, i create an index after creating the table:

 
olze oli
Ranch Hand
Posts: 149
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As i cannot use PreparedStatements, i decided to use plpgsql to create procedures which dont have to be compiled all the time. Because i saw in the logs that a lot of time gets wasted during parsing the statements, and they only change slightly (tablename and link). Or is this a bad idea??
 
kri shan
Ranch Hand
Posts: 1478
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is crawler ? Is it open source tool ?
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Creating a table and index for each host seems inefficient. Also, it seems you could save an update if you marked the processed field true on insert. Another benefit of using one table for all hosts is that you could more easily use a prepared statement.
 
olze oli
Ranch Hand
Posts: 149
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
that crawler is not opensource. if it would be, i would have posted the source ;)

that crawler has to handle really a lot of links and hosts - its not possible to do it in any other way.
a link is processed when it is downloaded and processed, not before these 2 steps.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic