Lecture 08¶

In [1]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

Load in the IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder lec06 pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with ln) to this lecture data directory, allowing some space saving, and unzip the database file.

In [2]:
!ln -sf ../../lec/lec06/data .
!unzip -u data/imdb_perf_lecture.zip -d data/
Archive:  data/imdb_perf_lecture.zip
In [3]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

Start jupysql¶

In [4]:
%reload_ext sql
In [5]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

If you're having trouble seeing the entirety of query plans, you can run the following cell to set the limit on displayed rows to 20. Careful: Do not set this to None and run the actual queries; SQL will return millions of rows and crash your kernel!

In [6]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

Matching¶

It is much easier to see query plans in psql!
jupysql dataframe visualization removes any whitespace.

You can also run (after each cell):

result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
In [7]:
%%sql
/* 1 */
EXPLAIN ANALYZE
SELECT id FROM actors
WHERE id > 4000000 AND
name='Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[7]:
QUERY PLAN
Gather (cost=1000.00..11653.80 rows=1410 width=4) (actual time=27.728..30.058 rows=0 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on actors (cost=0.00..10512.80 rows=588 width=4) (actual time=24.968..24.969 rows=0 loops=3)
        Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
        Rows Removed by Filter: 281963
Planning Time: 0.147 ms
Execution Time: 30.076 ms
In [8]:
%%sql
/* 2 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY name
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
12 rows affected.
Out[8]:
QUERY PLAN
Limit (cost=17366.94..17368.11 rows=10 width=36) (actual time=140.249..142.642 rows=10 loops=1)
  -> Gather Merge (cost=17366.94..99611.72 rows=704906 width=36) (actual time=140.248..142.637 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -> Sort (cost=16366.92..17248.05 rows=352453 width=36) (actual time=137.363..137.365 rows=8 loops=3)
              Sort Key: name
              Sort Method: top-N heapsort Memory: 26kB
              Worker 0: Sort Method: top-N heapsort Memory: 26kB
              Worker 1: Sort Method: top-N heapsort Memory: 26kB
              -> Parallel Seq Scan on actors (cost=0.00..8750.53 rows=352453 width=36) (actual time=0.021..49.143 rows=281963 loops=3)
Planning Time: 0.108 ms
Execution Time: 142.692 ms
In [9]:
%%sql
/* 3 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[9]:
QUERY PLAN
Limit (cost=0.42..0.93 rows=10 width=4) (actual time=0.089..0.093 rows=10 loops=1)
  -> Index Only Scan using actor_pkey on actors (cost=0.42..42884.75 rows=845888 width=4) (actual time=0.088..0.091 rows=10 loops=1)
        Heap Fetches: 10
Planning Time: 0.081 ms
Execution Time: 0.106 ms

Two-table demo: LIMIT¶

Let's join two tables, actors and cast_info. The query planner selects a hash join:

In [10]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[10]:
QUERY PLAN
Hash Join (cost=30867.48..92474.12 rows=2212088 width=44) (actual time=190.552..1385.646 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.021..160.925 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=36) (actual time=190.117..190.118 rows=845888 loops=1)
        Buckets: 65536 Batches: 32 Memory Usage: 1810kB
        -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.006..68.439 rows=845888 loops=1)
Planning Time: 0.221 ms
Execution Time: 1449.252 ms
In [11]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                          QUERY PLAN
0  Hash Join  (cost=30867.48..92474.12 rows=22120...
1       Hash Cond: (cast_info.person_id = actors.id)
2    ->  Seq Scan on cast_info  (cost=0.00..31908...
3    ->  Hash  (cost=13684.88..13684.88 rows=8458...
4          Buckets: 65536  Batches: 32  Memory Us...
5          ->  Seq Scan on actors  (cost=0.00..13...
6                            Planning Time: 0.221 ms
7                        Execution Time: 1449.252 ms



Below, we add LIMIT. Note the query planner switches to a nested loop join, using an index scan to match cast_info.person_id to the indexed attribute actors.id! This results in a 10,000x speedup!

In [12]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
7 rows affected.
Out[12]:
QUERY PLAN
Limit (cost=0.42..5.13 rows=10 width=44) (actual time=0.052..0.165 rows=10 loops=1)
  -> Nested Loop (cost=0.42..1040953.73 rows=2212088 width=44) (actual time=0.051..0.163 rows=10 loops=1)
        -> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.025..0.026 rows=10 loops=1)
        -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=36) (actual time=0.013..0.013 rows=1 loops=10)
              Index Cond: (id = cast_info.person_id)
Planning Time: 0.112 ms
Execution Time: 0.185 ms
In [13]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                          QUERY PLAN
0  Limit  (cost=0.42..5.13 rows=10 width=44) (act...
1    ->  Nested Loop  (cost=0.42..1040953.73 rows...
2          ->  Seq Scan on cast_info  (cost=0.00....
3          ->  Index Scan using actor_pkey on act...
4                Index Cond: (id = cast_info.pers...
5                            Planning Time: 0.112 ms
6                           Execution Time: 0.185 ms

Two-table demo: Projection¶

It is much easier to see query plans in psql!
jupysql dataframe visualization removes any whitespace.
In [14]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[14]:
QUERY PLAN
Hash Join (cost=30867.48..92474.12 rows=2212088 width=36) (actual time=258.110..1695.561 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.012..293.502 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=36) (actual time=257.724..257.726 rows=845888 loops=1)
        Buckets: 65536 Batches: 32 Memory Usage: 1848kB
        -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.007..102.280 rows=845888 loops=1)
Planning Time: 0.092 ms
Execution Time: 1761.153 ms
In [15]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                          QUERY PLAN
0  Hash Join  (cost=30867.48..92474.12 rows=22120...
1       Hash Cond: (cast_info.person_id = actors.id)
2    ->  Seq Scan on cast_info  (cost=0.00..31908...
3    ->  Hash  (cost=13684.88..13684.88 rows=8458...
4          Buckets: 65536  Batches: 32  Memory Us...
5          ->  Seq Scan on actors  (cost=0.00..13...
6                            Planning Time: 0.092 ms
7                        Execution Time: 1761.153 ms


The below is not as substantial a reduction, but still about a quarter speed-up.

  • Notice that projection was pushed down below the join “at source”.
  • If we waited until join was done, would be at least as expensive.
In [16]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id AND actors.id > 4000000;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
12 rows affected.
Out[16]:
QUERY PLAN
Hash Join (cost=19763.71..76964.36 rows=737364 width=36) (actual time=169.979..1279.023 rows=634763 loops=1)
  Hash Cond: (cast_info.person_id = actors.id)
  -> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.009..253.033 rows=2211936 loops=1)
  -> Hash (cost=14036.18..14036.18 rows=281963 width=36) (actual time=169.158..169.164 rows=444781 loops=1)
        Buckets: 65536 Batches: 8 Memory Usage: 3335kB
        -> Bitmap Heap Scan on actors (cost=5285.64..14036.18 rows=281963 width=36) (actual time=25.041..87.701 rows=444781 loops=1)
              Recheck Cond: (id > 4000000)
              Heap Blocks: exact=3088
              -> Bitmap Index Scan on actor_pkey (cost=0.00..5215.15 rows=281963 width=0) (actual time=24.660..24.660 rows=444781 loops=1)
                    Index Cond: (id > 4000000)
Planning Time: 0.123 ms
Execution Time: 1330.343 ms
In [17]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Hash Join  (cost=19763.71..76964.36 rows=73736...
1        Hash Cond: (cast_info.person_id = actors.id)
2     ->  Seq Scan on cast_info  (cost=0.00..31908...
3     ->  Hash  (cost=14036.18..14036.18 rows=2819...
4           Buckets: 65536  Batches: 8  Memory Usa...
5           ->  Bitmap Heap Scan on actors  (cost=...
6                        Recheck Cond: (id > 4000000)
7                             Heap Blocks: exact=3088
8                 ->  Bitmap Index Scan on actor_p...
9                          Index Cond: (id > 4000000)
10                            Planning Time: 0.123 ms
11                        Execution Time: 1330.343 ms

Three-way joins¶

In [18]:
%%sql 
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
10 rows affected.
Out[18]:
QUERY PLAN
Limit (cost=0.85..10.10 rows=10 width=74) (actual time=0.066..0.116 rows=10 loops=1)
  -> Nested Loop (cost=0.85..2046318.54 rows=2211936 width=74) (actual time=0.065..0.114 rows=10 loops=1)
        -> Nested Loop (cost=0.42..1040884.95 rows=2211936 width=44) (actual time=0.017..0.052 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..0.008 rows=10 loops=1)
              -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=36) (actual time=0.004..0.004 rows=1 loops=10)
                    Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.006..0.006 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.336 ms
Execution Time: 0.138 ms
In [19]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                          QUERY PLAN
0  Limit  (cost=0.85..10.10 rows=10 width=74) (ac...
1    ->  Nested Loop  (cost=0.85..2046318.54 rows...
2          ->  Nested Loop  (cost=0.42..1040884.9...
3                ->  Seq Scan on cast_info  (cost...
4                ->  Index Scan using actor_pkey ...
5                      Index Cond: (id = cast_inf...
6          ->  Index Scan using movie_pkey on mov...
7                Index Cond: (id = cast_info.movi...
8                            Planning Time: 0.336 ms
9                           Execution Time: 0.138 ms



Below, note the predicate pushdown in the sequential scan on actors! Again, copy-paste into psql if you can't see the whitespace formatting.

In [20]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND name = 'Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
16 rows affected.
Out[20]:
QUERY PLAN
Gather (cost=10654.12..35277.87 rows=11059 width=74) (actual time=199.597..394.324 rows=65 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Nested Loop (cost=9654.12..33171.97 rows=4608 width=74) (actual time=222.089..387.662 rows=22 loops=3)
        -> Parallel Hash Join (cost=9653.69..31077.41 rows=4608 width=44) (actual time=221.993..387.051 rows=22 loops=3)
              Hash Cond: (cast_info.person_id = actors.id)
              -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.041..141.876 rows=737312 loops=3)
              -> Parallel Hash (cost=9631.67..9631.67 rows=1762 width=36) (actual time=56.761..56.762 rows=0 loops=3)
                    Buckets: 8192 Batches: 1 Memory Usage: 96kB
                    -> Parallel Seq Scan on actors (cost=0.00..9631.67 rows=1762 width=36) (actual time=16.331..25.688 rows=0 loops=3)
                          Filter: (name = 'Tom Hanks'::text)
                          Rows Removed by Filter: 281962
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.027..0.027 rows=1 loops=65)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.155 ms
Execution Time: 394.368 ms
In [21]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Gather  (cost=10654.12..35277.87 rows=11059 wi...
1                                  Workers Planned: 2
2                                 Workers Launched: 2
3     ->  Nested Loop  (cost=9654.12..33171.97 row...
4           ->  Parallel Hash Join  (cost=9653.69....
5                 Hash Cond: (cast_info.person_id ...
6                 ->  Parallel Seq Scan on cast_in...
7                 ->  Parallel Hash  (cost=9631.67...
8                       Buckets: 8192  Batches: 1 ...
9                       ->  Parallel Seq Scan on a...
10                            Filter: (name = 'Tom...
11                            Rows Removed by Filt...
12          ->  Index Scan using movie_pkey on mov...
13                Index Cond: (id = cast_info.movi...
14                            Planning Time: 0.155 ms
15                         Execution Time: 394.368 ms



Compare with the below predicate pushdown, where the filter is now on movie titles:

In [22]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND title LIKE 'Snakes on a Plane';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
16 rows affected.
Out[22]:
QUERY PLAN
Gather (cost=9279.46..30704.82 rows=7 width=74) (actual time=402.117..405.350 rows=4 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Nested Loop (cost=8279.46..29704.12 rows=3 width=74) (actual time=273.025..399.542 rows=1 loops=3)
        -> Parallel Hash Join (cost=8279.04..29702.75 rows=3 width=38) (actual time=272.947..399.439 rows=1 loops=3)
              Hash Cond: (cast_info.movie_id = movies.id)
              -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.030..212.064 rows=737312 loops=3)
              -> Parallel Hash (cost=8279.03..8279.03 rows=1 width=30) (actual time=22.482..22.483 rows=0 loops=3)
                    Buckets: 1024 Batches: 1 Memory Usage: 40kB
                    -> Parallel Seq Scan on movies (cost=0.00..8279.03 rows=1 width=30) (actual time=17.106..22.422 rows=0 loops=3)
                          Filter: (title ~~ 'Snakes on a Plane'::text)
                          Rows Removed by Filter: 218817
        -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=36) (actual time=0.070..0.070 rows=1 loops=4)
              Index Cond: (id = cast_info.person_id)
Planning Time: 0.278 ms
Execution Time: 405.404 ms
In [23]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Gather  (cost=9279.46..30704.82 rows=7 width=7...
1                                  Workers Planned: 2
2                                 Workers Launched: 2
3     ->  Nested Loop  (cost=8279.46..29704.12 row...
4           ->  Parallel Hash Join  (cost=8279.04....
5                 Hash Cond: (cast_info.movie_id =...
6                 ->  Parallel Seq Scan on cast_in...
7                 ->  Parallel Hash  (cost=8279.03...
8                       Buckets: 1024  Batches: 1 ...
9                       ->  Parallel Seq Scan on m...
10                            Filter: (title ~~ 'S...
11                            Rows Removed by Filt...
12          ->  Index Scan using actor_pkey on act...
13                Index Cond: (id = cast_info.pers...
14                            Planning Time: 0.278 ms
15                         Execution Time: 405.404 ms

Three-way joins with Indexes¶

In [24]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
14 rows affected.
Out[24]:
QUERY PLAN
Limit (cost=0.86..9.60 rows=10 width=74) (actual time=0.042..0.119 rows=10 loops=1)
  -> Nested Loop (cost=0.86..1932285.82 rows=2211936 width=74) (actual time=0.041..0.117 rows=10 loops=1)
        -> Nested Loop (cost=0.43..926852.23 rows=2211936 width=44) (actual time=0.030..0.082 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.011..0.012 rows=10 loops=1)
              -> Memoize (cost=0.43..0.47 rows=1 width=36) (actual time=0.006..0.006 rows=1 loops=10)
                    Cache Key: cast_info.person_id
                    Cache Mode: logical
                    Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
                    -> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=36) (actual time=0.006..0.006 rows=1 loops=8)
                          Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.003..0.003 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.232 ms
Execution Time: 1.233 ms
In [25]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Limit  (cost=0.86..9.60 rows=10 width=74) (act...
1     ->  Nested Loop  (cost=0.86..1932285.82 rows...
2           ->  Nested Loop  (cost=0.43..926852.23...
3                 ->  Seq Scan on cast_info  (cost...
4                 ->  Memoize  (cost=0.43..0.47 ro...
5                       Cache Key: cast_info.perso...
6                                 Cache Mode: logical
7                       Hits: 2  Misses: 8  Evicti...
8                       ->  Index Scan using actor...
9                             Index Cond: (id = ca...
10          ->  Index Scan using movie_pkey on mov...
11                Index Cond: (id = cast_info.movi...
12                            Planning Time: 0.232 ms
13                           Execution Time: 1.233 ms



What if we dropped one of the indexes?

To do so we must drop the primary key constraint on actors.id:

In [26]:
%sql ALTER TABLE actors DROP CONSTRAINT actor_pkey CASCADE;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[26]:
In [27]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
12 rows affected.
Out[27]:
QUERY PLAN
Limit (cost=0.42..25271.13 rows=10 width=74) (actual time=295.628..296.522 rows=10 loops=1)
  -> Nested Loop (cost=0.42..35381485254.95 rows=14000989 width=74) (actual time=295.627..296.519 rows=10 loops=1)
        -> Nested Loop (cost=0.00..35375121117.60 rows=14000989 width=44) (actual time=295.591..296.409 rows=10 loops=1)
              Join Filter: (actors.id = cast_info.person_id)
              Rows Removed by Join Filter: 1118182
              -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.009..0.010 rows=1 loops=1)
              -> Materialize (cost=0.00..51608.04 rows=2211936 width=8) (actual time=0.007..230.778 rows=1118192 loops=1)
                    -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.005..75.815 rows=1118192 loops=1)
        -> Index Scan using movie_pkey on movies (cost=0.42..0.45 rows=1 width=30) (actual time=0.010..0.010 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.401 ms
Execution Time: 300.628 ms
In [28]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Limit  (cost=0.42..25271.13 rows=10 width=74) ...
1     ->  Nested Loop  (cost=0.42..35381485254.95 ...
2           ->  Nested Loop  (cost=0.00..353751211...
3                 Join Filter: (actors.id = cast_i...
4                 Rows Removed by Join Filter: 111...
5                 ->  Seq Scan on actors  (cost=0....
6                 ->  Materialize  (cost=0.00..516...
7                       ->  Seq Scan on cast_info ...
8           ->  Index Scan using movie_pkey on mov...
9                 Index Cond: (id = cast_info.movi...
10                            Planning Time: 0.401 ms
11                         Execution Time: 300.628 ms



What if we dropped both indexes?

In [29]:
%sql ALTER TABLE movies DROP CONSTRAINT movie_pkey CASCADE;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[29]:
In [30]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
17 rows affected.
Out[30]:
QUERY PLAN
Limit (cost=13884.25..44371.17 rows=10 width=74) (actual time=628.635..1815.969 rows=10 loops=1)
  -> Nested Loop (cost=13884.25..42684723061.69 rows=14000989 width=74) (actual time=628.634..1815.952 rows=10 loops=1)
        Join Filter: (cast_info.person_id = actors.id)
        Rows Removed by Join Filter: 7743513
        -> Gather (cost=13884.25..270450.57 rows=2211936 width=38) (actual time=608.599..639.809 rows=10 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              -> Parallel Hash Join (cost=12884.25..48256.97 rows=921640 width=38) (actual time=603.671..604.109 rows=673 loops=3)
                    Hash Cond: (cast_info.movie_id = movies.id)
                    -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.022..189.338 rows=737312 loops=3)
                    -> Parallel Hash (cost=7595.22..7595.22 rows=273522 width=30) (actual time=115.192..115.193 rows=218818 loops=3)
                          Buckets: 65536 Batches: 16 Memory Usage: 0kB
                          -> Parallel Seq Scan on movies (cost=0.00..7595.22 rows=273522 width=30) (actual time=0.019..17.209 rows=218818 loops=3)
        -> Materialize (cost=0.00..24523.32 rows=845888 width=36) (actual time=0.004..73.982 rows=774352 loops=10)
              -> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.015..53.748 rows=845888 loops=1)
Planning Time: 0.187 ms
Execution Time: 1822.068 ms
In [31]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
                                           QUERY PLAN
0   Limit  (cost=13884.25..44371.17 rows=10 width=...
1     ->  Nested Loop  (cost=13884.25..42684723061...
2           Join Filter: (cast_info.person_id = ac...
3                Rows Removed by Join Filter: 7743513
4           ->  Gather  (cost=13884.25..270450.57 ...
5                                  Workers Planned: 2
6                                 Workers Launched: 2
7                 ->  Parallel Hash Join  (cost=12...
8                       Hash Cond: (cast_info.movi...
9                       ->  Parallel Seq Scan on c...
10                      ->  Parallel Hash  (cost=7...
11                            Buckets: 65536  Batc...
12                            ->  Parallel Seq Sca...
13          ->  Materialize  (cost=0.00..24523.32 ...
14                ->  Seq Scan on actors  (cost=0....
15                            Planning Time: 0.187 ms
16                        Execution Time: 1822.068 ms

Cleanup¶

We close the connection, then drop the database:

In [32]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture
In [33]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
DROP DATABASE