At molescrape I am currently storing all crawling results into a PostgreSQL storage. This was a conscious decision after I had experienced the mess you can get into if you choose the best tool for every part of your application and end up with 10 different services you have to maintain and connect (also some form of premature optimization). Thus, I decided to store everything into PostgreSQL until I find out that I cannot handle something in PostgreSQL anymore. At this point I would pull out only this one component into another solution.

With this approach I am currently working with 100GB of crawling data in one table and its growing each day. I am planning to swap out old data, but my assumption is that I will be able to work with up to 1TB of data in PostgreSQL without any troubles.

I guess that during this journey I will find out a few things good to know for working with large tables which I want to share in several blog posts. In the first one we will have a look at PostgreSQLs EXPLAIN feature and Server Side Cursors. Partitioning of large tables will probably also be a topc in the future.

The EXPLAIN command

PostgreSQL has an EXPLAIN command which you can put in front of a query. It will then show you the execution plan for this query.

E.g. in one of my post processing scripts I was querying for the latest data from all spiders of one specific project. Even without retrieving all the data (thanks to server side cursors, which we will see later in this post) my query was quite slow. I used EXPLAIN to check how the query was executed:

EXPLAIN SELECT payload FROM skyscraper_spiders_results
WHERE spider_id IN
(
    SELECT spider_id FROM skyscraper_spiders s
    JOIN projects p ON s.project_id = p.project_id
    WHERE p.name = 'my-project'
)
AND crawl_date > NOW() at time zone 'utc' - interval '30 day'

PostgreSQL showed me this execution plan:

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=7234.29..29119.88 rows=9304 width=415)
   ->  HashAggregate  (cost=23.31..23.32 rows=1 width=4)
         Group Key: s.spider_id
         ->  Nested Loop  (cost=4.31..23.30 rows=1 width=4)
               ->  Index Scan using uq_name on projects p  (cost=0.14..8.16 rows=1 width=4)
                     Index Cond: ((name)::text = 'my-project'::text)
               ->  Bitmap Heap Scan on skyscraper_spiders s  (cost=4.17..15.10 rows=4 width=8)
                     Recheck Cond: (project_id = p.project_id)
                     ->  Bitmap Index Scan on skyscraper_spiders_project_id_idx  (cost=0.00..4.17 rows=4 width=0)
                           Index Cond: (project_id = p.project_id)
   ->  Bitmap Heap Scan on skyscraper_spiders_results  (cost=7210.98..29003.53 rows=9304 width=419)
         Recheck Cond: ((spider_id = s.spider_id) AND (crawl_date > (timezone('utc'::text, now()) - '30 days'::interval)))
         ->  BitmapAnd  (cost=7210.98..7210.98 rows=9304 width=0)
               ->  Bitmap Index Scan on skyscraper_spiders_results_spider_id_idx  (cost=0.00..2681.34 rows=180939 width=0)
                     Index Cond: (spider_id = s.spider_id)
               ->  Bitmap Index Scan on idx_spiders_results_crawl_date  (cost=0.00..4468.92 rows=232597 width=0)
                     Index Cond: (crawl_date > (timezone('utc'::text, now()) - '30 days'::interval))

You don’t have to understand the whole query plan for now, but for each action in this query plan you see PostgreSQL’s estimations in parantheses. The first value of the cost is the start-up cost esimation and the second value is the total cost estimation (to retrieve all values). Rows is the estimated number of rows returned and width the esimation for the average number of bytes per row. The costs are not given in seconds, but in “arbitrary units determined by the planner’s cost parameters” (docs).

We can see that there is relatively high start-up cost for our query (7234.29) which is caused by the BitmapAnd (startup cost 7210.98). This BitmapAnd is required, because the table does not have a combined index over both the spider_id and crawl_date at the same time. Each of the fields has an index, but PostgreSQL has to combine them with a BitmapAnd operation.

We can mitigate this by creating the missing index. The query plan is then:

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=258.97..22144.57 rows=9304 width=415)
   ->  HashAggregate  (cost=23.31..23.32 rows=1 width=4)
         Group Key: s.spider_id
         ->  Nested Loop  (cost=4.31..23.30 rows=1 width=4)
               ->  Index Scan using uq_name on projects p  (cost=0.14..8.16 rows=1 width=4)
                     Index Cond: ((name)::text = 'my-project'::text)
               ->  Bitmap Heap Scan on skyscraper_spiders s  (cost=4.17..15.10 rows=4 width=8)
                     Recheck Cond: (project_id = p.project_id)
                     ->  Bitmap Index Scan on skyscraper_spiders_project_id_idx  (cost=0.00..4.17 rows=4 width=0)
                           Index Cond: (project_id = p.project_id)
   ->  Bitmap Heap Scan on skyscraper_spiders_results  (cost=235.67..22028.22 rows=9304 width=419)
         Recheck Cond: ((spider_id = s.spider_id) AND (crawl_date > (timezone('utc'::text, now()) - '30 days'::interval)))
         ->  Bitmap Index Scan on skyscraper_spiders_results_spider_id_crawl_date_idx  (cost=0.00..233.34 rows=9304 width=0)
               Index Cond: ((spider_id = s.spider_id) AND (crawl_date > (timezone('utc'::text, now()) - '30 days'::interval)))

This gives us a much better start-up cost even though the total cost still remains the same. However, this is OK for me, as I will then iterate the results with a PostgreSQL cursor and perform some processing on each row.

Server Side Cursors

With PostgreSQL server side cursors you can iterate results in batches of k items per batch. If you are using Python with psycopg2 you can use server side cursors directly by using a named cursor instead of an unnamed one:

If name is specified, the returned cursor will be a server side cursor (also known as named cursor). Otherwise it will be a regular client side cursor.

With server side cursors you can effectively go through a large number of results without loading them all at once from the database. However, you are not allowed to commit on the connection of your server side cursors as long as you are still iterating it. Otherwise, psycopg2 will complain that the server side cursor is not valid anymore. This means, that you have to wait with commits to the database until after you have iterated the whole cursor.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.