Lecture 08¶

In [1]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

Load in the IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder lec06 pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with ln) to this lecture data directory, allowing some space saving, and unzip the database file.

In [2]:
!ln -sf ../../lec/lec06/data .
!unzip -u data/imdb_perf_lecture.zip -d data/
Archive:  data/imdb_perf_lecture.zip
In [3]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

Start jupysql¶

In [4]:
%reload_ext sql
In [6]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
Connecting to 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

If you're having trouble seeing the entirety of query plans, you can run the following cell to set the limit on displayed rows to 20. Careful: Do not set this to None and run the actual queries; SQL will return millions of rows and crash your kernel!

In [5]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

Matching¶

It is much easier to see query plans in psql!
jupysql dataframe visualization removes any whitespace.

You can also run (after each cell):

result = __.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
In [40]:
def printplans(x):
    result = x.DataFrame()
    result.style.set_properties(**{'text-align': 'left'})
    print(result)
In [41]:
%%sql
/* 1 */
EXPLAIN ANALYZE SELECT id FROM actors
WHERE id > 4000000 AND
name='Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[41]:
QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=24.124..26.703 rows=0 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on actors (cost=0.00..10512.80 rows=1 width=4) (actual time=21.578..21.579 rows=0 loops=3)
        Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
        Rows Removed by Filter: 281963
Planning Time: 0.065 ms
Execution Time: 26.756 ms
In [42]:
printplans(__)
                                          QUERY PLAN
0  Gather  (cost=1000.00..11512.90 rows=1 width=4...
1                                 Workers Planned: 2
2                                Workers Launched: 2
3    ->  Parallel Seq Scan on actors  (cost=0.00....
4          Filter: ((id > 4000000) AND (name = 'T...
5                     Rows Removed by Filter: 281963
6                            Planning Time: 0.065 ms
7                          Execution Time: 26.756 ms
In [43]:
%%sql
/* 2 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY name;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
6 rows affected.
Out[43]:
QUERY PLAN
Sort (cost=114312.51..116427.23 rows=845888 width=18) (actual time=1947.844..2768.135 rows=845888 loops=1)
  Sort Key: name
  Sort Method: external merge Disk: 23672kB
  -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.056..55.089 rows=845888 loops=1)
Planning Time: 0.051 ms
Execution Time: 2815.284 ms
In [44]:
printplans(__)
                                          QUERY PLAN
0  Sort  (cost=114312.51..116427.23 rows=845888 w...
1                                     Sort Key: name
2         Sort Method: external merge  Disk: 23672kB
3    ->  Seq Scan on actors  (cost=0.00..13684.88...
4                            Planning Time: 0.051 ms
5                        Execution Time: 2815.284 ms
In [45]:
%%sql
/* 3 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[45]:
QUERY PLAN
Limit (cost=0.42..0.68 rows=10 width=4) (actual time=0.022..0.024 rows=10 loops=1)
  -> Index Only Scan using actor_pkey on actors (cost=0.42..21980.74 rows=845888 width=4) (actual time=0.021..0.022 rows=10 loops=1)
        Heap Fetches: 0
Planning Time: 0.076 ms
Execution Time: 0.037 ms
In [46]:
printplans(__)
                                          QUERY PLAN
0  Limit  (cost=0.42..0.68 rows=10 width=4) (actu...
1    ->  Index Only Scan using actor_pkey on acto...
2                                    Heap Fetches: 0
3                            Planning Time: 0.076 ms
4                           Execution Time: 0.037 ms

Two-table demo: LIMIT¶

Let's join two tables, actors and cast_info. The query planner selects a hash join:

In [47]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[47]:
QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=171.285..1446.736 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.069..139.423 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=171.049..171.050 rows=845888 loops=1)
        Buckets: 65536 Batches: 16 Memory Usage: 3114kB
        -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.014..52.583 rows=845888 loops=1)
Planning Time: 0.152 ms
Execution Time: 1513.363 ms
In [48]:
printplans(__)
                                          QUERY PLAN
0  Hash Join  (cost=29215.48..89168.21 rows=22119...
1       Hash Cond: (cast_info.person_id = actors.id)
2    ->  Seq Scan on cast_info  (cost=0.00..31907...
3    ->  Hash  (cost=13684.88..13684.88 rows=8458...
4          Buckets: 65536  Batches: 16  Memory Us...
5          ->  Seq Scan on actors  (cost=0.00..13...
6                            Planning Time: 0.152 ms
7                        Execution Time: 1513.363 ms



Below, we add LIMIT. Note the query planner switches to a nested loop join, using an index scan to match cast_info.person_id to the indexed attribute actors.id! This results in a 10,000x speedup!

In [49]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
11 rows affected.
Out[49]:
QUERY PLAN
Limit (cost=0.43..4.49 rows=10 width=26) (actual time=0.085..0.227 rows=10 loops=1)
  -> Nested Loop (cost=0.43..895896.13 rows=2211936 width=26) (actual time=0.084..0.225 rows=10 loops=1)
        -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.051..0.052 rows=10 loops=1)
        -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.017..0.017 rows=1 loops=10)
              Cache Key: cast_info.person_id
              Cache Mode: logical
              Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
              -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=18) (actual time=0.020..0.020 rows=1 loops=8)
                    Index Cond: (id = cast_info.person_id)
Planning Time: 0.166 ms
Execution Time: 0.497 ms
In [50]:
printplans(__)
                                           QUERY PLAN
0   Limit  (cost=0.43..4.49 rows=10 width=26) (act...
1     ->  Nested Loop  (cost=0.43..895896.13 rows=...
2           ->  Seq Scan on cast_info  (cost=0.00....
3           ->  Memoize  (cost=0.43..0.47 rows=1 w...
4                      Cache Key: cast_info.person_id
5                                 Cache Mode: logical
6                 Hits: 2  Misses: 8  Evictions: 0...
7                 ->  Index Scan using actor_pkey ...
8                       Index Cond: (id = cast_inf...
9                             Planning Time: 0.166 ms
10                           Execution Time: 0.497 ms

Two-table demo: Projection¶

It is much easier to see query plans in psql!
jupysql dataframe visualization removes any whitespace.
In [51]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[51]:
QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=18) (actual time=236.467..1300.556 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.009..205.463 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=236.305..236.307 rows=845888 loops=1)
        Buckets: 65536 Batches: 16 Memory Usage: 3191kB
        -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.063..83.934 rows=845888 loops=1)
Planning Time: 0.145 ms
Execution Time: 1363.600 ms
In [52]:
printplans(__)
                                          QUERY PLAN
0  Hash Join  (cost=29215.48..89168.21 rows=22119...
1       Hash Cond: (cast_info.person_id = actors.id)
2    ->  Seq Scan on cast_info  (cost=0.00..31907...
3    ->  Hash  (cost=13684.88..13684.88 rows=8458...
4          Buckets: 65536  Batches: 16  Memory Us...
5          ->  Seq Scan on actors  (cost=0.00..13...
6                            Planning Time: 0.145 ms
7                        Execution Time: 1363.600 ms


The below is not as substantial a reduction, but still about a quarter speed-up.

  • Notice that projection was pushed down below the join “at source”.
  • If we waited until join was done, would be at least as expensive.
In [53]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id AND actors.id > 4000000;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
10 rows affected.
Out[53]:
QUERY PLAN
Hash Join (cost=23988.61..81598.34 rows=1166260 width=18) (actual time=156.138..1079.895 rows=634763 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.009..208.164 rows=2211936 loops=1)
  -> Hash (cost=15799.60..15799.60 rows=446001 width=18) (actual time=155.629..155.631 rows=444781 loops=1)
        Buckets: 65536 Batches: 8 Memory Usage: 3335kB
        -> Seq Scan on actors (cost=0.00..15799.60 rows=446001 width=18) (actual time=0.265..76.297 rows=444781 loops=1)
              Filter: (id > 4000000)
              Rows Removed by Filter: 401107
Planning Time: 0.151 ms
Execution Time: 1098.684 ms
In [54]:
printplans(__)
                                          QUERY PLAN
0  Hash Join  (cost=23988.61..81598.34 rows=11662...
1       Hash Cond: (cast_info.person_id = actors.id)
2    ->  Seq Scan on cast_info  (cost=0.00..31907...
3    ->  Hash  (cost=15799.60..15799.60 rows=4460...
4          Buckets: 65536  Batches: 8  Memory Usa...
5          ->  Seq Scan on actors  (cost=0.00..15...
6                             Filter: (id > 4000000)
7                     Rows Removed by Filter: 401107
8                            Planning Time: 0.151 ms
9                        Execution Time: 1098.684 ms

Three-way joins¶

In [55]:
%%sql 
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
14 rows affected.
Out[55]:
QUERY PLAN
Limit (cost=0.86..9.46 rows=10 width=56) (actual time=0.071..0.152 rows=10 loops=1)
  -> Nested Loop (cost=0.86..1901329.72 rows=2211936 width=56) (actual time=0.071..0.149 rows=10 loops=1)
        -> Nested Loop (cost=0.43..895896.13 rows=2211936 width=26) (actual time=0.023..0.077 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..0.010 rows=10 loops=1)
              -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=10)
                    Cache Key: cast_info.person_id
                    Cache Mode: logical
                    Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
                    -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=8)
                          Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.006..0.007 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.533 ms
Execution Time: 0.442 ms
In [56]:
printplans(__)
                                           QUERY PLAN
0   Limit  (cost=0.86..9.46 rows=10 width=56) (act...
1     ->  Nested Loop  (cost=0.86..1901329.72 rows...
2           ->  Nested Loop  (cost=0.43..895896.13...
3                 ->  Seq Scan on cast_info  (cost...
4                 ->  Memoize  (cost=0.43..0.47 ro...
5                       Cache Key: cast_info.perso...
6                                 Cache Mode: logical
7                       Hits: 2  Misses: 8  Evicti...
8                       ->  Index Scan using actor...
9                             Index Cond: (id = ca...
10          ->  Index Scan using movie_pkey on mov...
11                Index Cond: (id = cast_info.movi...
12                            Planning Time: 0.533 ms
13                           Execution Time: 0.442 ms



Below, note the predicate pushdown in the sequential scan on actors! Again, copy-paste into psql if you can't see the whitespace formatting.

In [58]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND name = 'Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
16 rows affected.
Out[58]:
QUERY PLAN
Gather (cost=10632.10..32056.15 rows=3 width=56) (actual time=175.633..295.328 rows=65 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Nested Loop (cost=9632.10..31055.85 rows=1 width=56) (actual time=173.023..288.885 rows=22 loops=3)
        -> Parallel Hash Join (cost=9631.68..31055.40 rows=1 width=26) (actual time=172.939..288.279 rows=22 loops=3)
              Hash Cond: (cast_info.person_id = actors.id)
              -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.030..104.277 rows=737312 loops=3)
              -> Parallel Hash (cost=9631.67..9631.67 rows=1 width=18) (actual time=21.469..21.470 rows=0 loops=3)
                    Buckets: 1024 Batches: 1 Memory Usage: 40kB
                    -> Parallel Seq Scan on actors (cost=0.00..9631.67 rows=1 width=18) (actual time=13.520..21.411 rows=0 loops=3)
                          Filter: (name = 'Tom Hanks'::text)
                          Rows Removed by Filter: 281962
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.027..0.027 rows=1 loops=65)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.257 ms
Execution Time: 295.372 ms
In [59]:
printplans(__)
                                           QUERY PLAN
0   Gather  (cost=10632.10..32056.15 rows=3 width=...
1                                  Workers Planned: 2
2                                 Workers Launched: 2
3     ->  Nested Loop  (cost=9632.10..31055.85 row...
4           ->  Parallel Hash Join  (cost=9631.68....
5                 Hash Cond: (cast_info.person_id ...
6                 ->  Parallel Seq Scan on cast_in...
7                 ->  Parallel Hash  (cost=9631.67...
8                       Buckets: 1024  Batches: 1 ...
9                       ->  Parallel Seq Scan on a...
10                            Filter: (name = 'Tom...
11                            Rows Removed by Filt...
12          ->  Index Scan using movie_pkey on mov...
13                Index Cond: (id = cast_info.movi...
14                            Planning Time: 0.257 ms
15                         Execution Time: 295.372 ms



Compare with the below predicate pushdown, where the filter is now on movie titles:

In [60]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND title LIKE 'Snakes on a Plane';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
16 rows affected.
Out[60]:
QUERY PLAN
Gather (cost=9279.46..30704.82 rows=7 width=56) (actual time=325.931..329.236 rows=4 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Nested Loop (cost=8279.46..29704.12 rows=3 width=56) (actual time=222.206..323.329 rows=1 loops=3)
        -> Parallel Hash Join (cost=8279.04..29702.75 rows=3 width=38) (actual time=222.144..323.241 rows=1 loops=3)
              Hash Cond: (cast_info.movie_id = movies.id)
              -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.032..154.349 rows=737312 loops=3)
              -> Parallel Hash (cost=8279.03..8279.03 rows=1 width=30) (actual time=22.467..22.468 rows=0 loops=3)
                    Buckets: 1024 Batches: 1 Memory Usage: 40kB
                    -> Parallel Seq Scan on movies (cost=0.00..8279.03 rows=1 width=30) (actual time=17.099..22.391 rows=0 loops=3)
                          Filter: (title ~~ 'Snakes on a Plane'::text)
                          Rows Removed by Filter: 218817
        -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=18) (actual time=0.058..0.058 rows=1 loops=4)
              Index Cond: (id = cast_info.person_id)
Planning Time: 0.270 ms
Execution Time: 329.274 ms
In [61]:
printplans(__)
                                           QUERY PLAN
0   Gather  (cost=9279.46..30704.82 rows=7 width=5...
1                                  Workers Planned: 2
2                                 Workers Launched: 2
3     ->  Nested Loop  (cost=8279.46..29704.12 row...
4           ->  Parallel Hash Join  (cost=8279.04....
5                 Hash Cond: (cast_info.movie_id =...
6                 ->  Parallel Seq Scan on cast_in...
7                 ->  Parallel Hash  (cost=8279.03...
8                       Buckets: 1024  Batches: 1 ...
9                       ->  Parallel Seq Scan on m...
10                            Filter: (title ~~ 'S...
11                            Rows Removed by Filt...
12          ->  Index Scan using actor_pkey on act...
13                Index Cond: (id = cast_info.pers...
14                            Planning Time: 0.270 ms
15                         Execution Time: 329.274 ms

Three-way joins with Indexes¶

In [62]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
14 rows affected.
Out[62]:
QUERY PLAN
Limit (cost=0.86..9.46 rows=10 width=56) (actual time=0.026..0.075 rows=10 loops=1)
  -> Nested Loop (cost=0.86..1901329.72 rows=2211936 width=56) (actual time=0.025..0.073 rows=10 loops=1)
        -> Nested Loop (cost=0.43..895896.13 rows=2211936 width=26) (actual time=0.020..0.053 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..0.008 rows=10 loops=1)
              -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=10)
                    Cache Key: cast_info.person_id
                    Cache Mode: logical
                    Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
                    -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=8)
                          Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.002..0.002 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.272 ms
Execution Time: 0.333 ms
In [63]:
printplans(__)
                                           QUERY PLAN
0   Limit  (cost=0.86..9.46 rows=10 width=56) (act...
1     ->  Nested Loop  (cost=0.86..1901329.72 rows...
2           ->  Nested Loop  (cost=0.43..895896.13...
3                 ->  Seq Scan on cast_info  (cost...
4                 ->  Memoize  (cost=0.43..0.47 ro...
5                       Cache Key: cast_info.perso...
6                                 Cache Mode: logical
7                       Hits: 2  Misses: 8  Evicti...
8                       ->  Index Scan using actor...
9                             Index Cond: (id = ca...
10          ->  Index Scan using movie_pkey on mov...
11                Index Cond: (id = cast_info.movi...
12                            Planning Time: 0.272 ms
13                           Execution Time: 0.333 ms



What if we dropped one of the indexes?

To do so we must drop the primary key constraint on actors.id:

In [64]:
%sql ALTER TABLE actors DROP CONSTRAINT actor_pkey CASCADE;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[64]:
In [65]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
15 rows affected.
Out[65]:
QUERY PLAN
Limit (cost=16222.62..16228.33 rows=10 width=56) (actual time=594.208..846.259 rows=10 loops=1)
  -> Nested Loop (cost=16222.62..1278418.10 rows=2211936 width=56) (actual time=594.206..846.256 rows=10 loops=1)
        -> Gather (cost=16222.20..272984.51 rows=2211936 width=26) (actual time=594.154..846.183 rows=10 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              -> Parallel Hash Join (cost=15222.20..50790.91 rows=921640 width=26) (actual time=564.550..564.972 rows=810 loops=3)
                    Hash Cond: (cast_info.person_id = actors.id)
                    -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.032..203.254 rows=737312 loops=3)
                    -> Parallel Hash (cost=8750.53..8750.53 rows=352453 width=18) (actual time=111.486..111.487 rows=281963 loops=3)
                          Buckets: 65536 Batches: 16 Memory Usage: 0kB
                          -> Parallel Seq Scan on actors (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.024..19.288 rows=281963 loops=3)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.006..0.006 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.286 ms
Execution Time: 846.299 ms
In [66]:
printplans(__)
                                           QUERY PLAN
0   Limit  (cost=16222.62..16228.33 rows=10 width=...
1     ->  Nested Loop  (cost=16222.62..1278418.10 ...
2           ->  Gather  (cost=16222.20..272984.51 ...
3                                  Workers Planned: 2
4                                 Workers Launched: 2
5                 ->  Parallel Hash Join  (cost=15...
6                       Hash Cond: (cast_info.pers...
7                       ->  Parallel Seq Scan on c...
8                       ->  Parallel Hash  (cost=8...
9                             Buckets: 65536  Batc...
10                            ->  Parallel Seq Sca...
11          ->  Index Scan using movie_pkey on mov...
12                Index Cond: (id = cast_info.movi...
13                            Planning Time: 0.286 ms
14                         Execution Time: 846.299 ms



What if we dropped both indexes?

In [67]:
%sql ALTER TABLE movies DROP CONSTRAINT movie_pkey CASCADE;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[67]:
In [68]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
17 rows affected.
Out[68]:
QUERY PLAN
Limit (cost=29106.44..29107.70 rows=10 width=56) (actual time=2159.897..2283.595 rows=10 loops=1)
  -> Gather (cost=29106.44..307637.07 rows=2211936 width=56) (actual time=2159.896..2283.591 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -> Parallel Hash Join (cost=28106.44..85443.47 rows=921640 width=56) (actual time=2155.183..2155.601 rows=4 loops=3)
              Hash Cond: (cast_info.movie_id = movies.id)
              -> Parallel Hash Join (cost=15222.20..50790.91 rows=921640 width=26) (actual time=692.958..1507.575 rows=737312 loops=3)
                    Hash Cond: (cast_info.person_id = actors.id)
                    -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.033..169.595 rows=737312 loops=3)
                    -> Parallel Hash (cost=8750.53..8750.53 rows=352453 width=18) (actual time=195.728..195.729 rows=281963 loops=3)
                          Buckets: 65536 Batches: 16 Memory Usage: 3360kB
                          -> Parallel Seq Scan on actors (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.026..41.592 rows=281963 loops=3)
              -> Parallel Hash (cost=7595.22..7595.22 rows=273522 width=30) (actual time=71.091..71.092 rows=218818 loops=3)
                    Buckets: 65536 Batches: 16 Memory Usage: 0kB
                    -> Parallel Seq Scan on movies (cost=0.00..7595.22 rows=273522 width=30) (actual time=0.019..16.267 rows=218818 loops=3)
Planning Time: 0.209 ms
Execution Time: 2283.632 ms
In [69]:
printplans(__)
                                           QUERY PLAN
0   Limit  (cost=29106.44..29107.70 rows=10 width=...
1     ->  Gather  (cost=29106.44..307637.07 rows=2...
2                                  Workers Planned: 2
3                                 Workers Launched: 2
4           ->  Parallel Hash Join  (cost=28106.44...
5                 Hash Cond: (cast_info.movie_id =...
6                 ->  Parallel Hash Join  (cost=15...
7                       Hash Cond: (cast_info.pers...
8                       ->  Parallel Seq Scan on c...
9                       ->  Parallel Hash  (cost=8...
10                            Buckets: 65536  Batc...
11                            ->  Parallel Seq Sca...
12                ->  Parallel Hash  (cost=7595.22...
13                      Buckets: 65536  Batches: 1...
14                      ->  Parallel Seq Scan on m...
15                            Planning Time: 0.209 ms
16                        Execution Time: 2283.632 ms

Cleanup¶

We close the connection, then drop the database:

In [70]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture
In [71]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
ERROR:  database "imdb_perf_lecture" is being accessed by other users
DETAIL:  There is 1 other session using the database.