Lecture 09: Query Optimization II¶

New IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the imdb_perf_lecture database.

In [3]:
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/
Archive:  ../lec07/data/imdb_perf_lecture.zip
In [4]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:33: ERROR:  role "yanlisa" does not exist
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:45: ERROR:  role "yanlisa" does not exist
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:59: ERROR:  role "yanlisa" does not exist
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
In [2]:
%reload_ext sql
There's a new jupysql version available (0.10.1), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
In [7]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)
In [3]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

Demo¶

In [4]:
%%sql
explain analyze
select *
from actor, cast_info
where actor.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[4]:
QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=1204.592..2899.130 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actor.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=32.664..576.889 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=1171.392..1171.394 rows=845888 loops=1)
        Buckets: 65536 Batches: 16 Memory Usage: 3114kB
        -> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=21.178..1044.234 rows=845888 loops=1)
Planning Time: 219.930 ms
Execution Time: 2965.872 ms

note hash join

In [8]:
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
11 rows affected.
Out[8]:
QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.038..0.096 rows=10 loops=1)
  -> Nested Loop (cost=0.43..888554.34 rows=2211936 width=18) (actual time=0.036..0.093 rows=10 loops=1)
        -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.014..0.016 rows=10 loops=1)
        -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=10)
              Cache Key: cast_info.person_id
              Cache Mode: logical
              Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
              -> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=8)
                    Index Cond: (id = cast_info.person_id)
Planning Time: 0.241 ms
Execution Time: 0.470 ms

notice reduced time by 6000x!

In [10]:
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[10]:
QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=18) (actual time=287.396..1435.852 rows=2211936 loops=1)
  Hash Cond: (cast_info.person_id = actor.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.013..223.331 rows=2211936 loops=1)
  -> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=287.183..287.185 rows=845888 loops=1)
        Buckets: 65536 Batches: 16 Memory Usage: 3191kB
        -> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.067..108.863 rows=845888 loops=1)
Planning Time: 0.212 ms
Execution Time: 1502.998 ms

notice not as substantial a reduction but still 2x.



In [11]:
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id and actor.id > 4000000;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
10 rows affected.
Out[11]:
QUERY PLAN
Hash Join (cost=23971.10..81574.83 rows=1163852 width=18) (actual time=173.505..1135.131 rows=634763 loops=1)
  Hash Cond: (cast_info.person_id = actor.id)
  -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..210.031 rows=2211936 loops=1)
  -> Hash (cost=15799.60..15799.60 rows=445080 width=18) (actual time=172.984..172.985 rows=444781 loops=1)
        Buckets: 65536 Batches: 8 Memory Usage: 3335kB
        -> Seq Scan on actor (cost=0.00..15799.60 rows=445080 width=18) (actual time=0.293..90.924 rows=444781 loops=1)
              Filter: (id > 4000000)
              Rows Removed by Filter: 401107
Planning Time: 0.163 ms
Execution Time: 1155.587 ms

notice that projection was pushed down below the join “at source”. If we waited until join was done, would be at least as exp



In [12]:
%%sql
-- repeat previous query
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
11 rows affected.
Out[12]:
QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.023..0.061 rows=10 loops=1)
  -> Nested Loop (cost=0.43..888554.34 rows=2211936 width=18) (actual time=0.023..0.059 rows=10 loops=1)
        -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..0.009 rows=10 loops=1)
        -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10)
              Cache Key: cast_info.person_id
              Cache Mode: logical
              Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
              -> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=8)
                    Index Cond: (id = cast_info.person_id)
Planning Time: 0.234 ms
Execution Time: 0.367 ms
In [19]:
%%sql
explain analyze
select actor.name,movie_id,actor.id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
11 rows affected.
Out[19]:
QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=22) (actual time=0.075..0.137 rows=10 loops=1)
  -> Nested Loop (cost=0.43..888554.34 rows=2211936 width=22) (actual time=0.074..0.134 rows=10 loops=1)
        -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.011..0.013 rows=10 loops=1)
        -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.011..0.011 rows=1 loops=10)
              Cache Key: cast_info.person_id
              Cache Mode: logical
              Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
              -> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=8)
                    Index Cond: (id = cast_info.person_id)
Planning Time: 0.265 ms
Execution Time: 0.445 ms

^ width of the seq scan is same as in the previous query, while the width after the hash join is smaller in the former – this is because actor.id is already extracted for the predicate

In [15]:
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
14 rows affected.
Out[15]:
QUERY PLAN
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=71.868..71.960 rows=10 loops=1)
  -> Nested Loop (cost=0.86..1893991.93 rows=2211936 width=56) (actual time=71.867..71.957 rows=10 loops=1)
        -> Nested Loop (cost=0.43..888554.34 rows=2211936 width=26) (actual time=0.021..0.090 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..0.009 rows=10 loops=1)
              -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=10)
                    Cache Key: cast_info.person_id
                    Cache Mode: logical
                    Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
                    -> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.006..0.007 rows=1 loops=8)
                          Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=7.186..7.186 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.300 ms
Execution Time: 72.262 ms
In [ ]:
# ^ note 2 hash joins where actor and cast_info are joined, followed by a join with movie
In [14]:
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id and name='Hanks, Tom' ;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
16 rows affected.
Out[14]:
QUERY PLAN
Gather (cost=10632.10..32056.15 rows=3 width=56) (actual time=60.662..62.390 rows=0 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Nested Loop (cost=9632.10..31055.85 rows=1 width=56) (actual time=25.098..25.102 rows=0 loops=3)
        -> Parallel Hash Join (cost=9631.68..31055.40 rows=1 width=26) (actual time=25.097..25.100 rows=0 loops=3)
              Hash Cond: (cast_info.person_id = actor.id)
              -> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (never executed)
              -> Parallel Hash (cost=9631.67..9631.67 rows=1 width=18) (actual time=24.962..24.963 rows=0 loops=3)
                    Buckets: 1024 Batches: 1 Memory Usage: 0kB
                    -> Parallel Seq Scan on actor (cost=0.00..9631.67 rows=1 width=18) (actual time=24.891..24.891 rows=0 loops=3)
                          Filter: (name = 'Hanks, Tom'::text)
                          Rows Removed by Filter: 281963
        -> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (never executed)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 138.613 ms
Execution Time: 62.427 ms
In [ ]:
# ^ got rid of limit; same as before, except notice the push down of seq scan 
In [ ]:
%%sql explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id and title='Snakes on a Plane';

Multi-table: impact of indexes¶

In [17]:
# repeat

# note 2 hash joins where actor and cast_info are joined, followed by a join with movie
In [18]:
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
14 rows affected.
Out[18]:
QUERY PLAN
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=0.041..0.119 rows=10 loops=1)
  -> Nested Loop (cost=0.86..1893991.93 rows=2211936 width=56) (actual time=0.040..0.117 rows=10 loops=1)
        -> Nested Loop (cost=0.43..888554.34 rows=2211936 width=26) (actual time=0.030..0.080 rows=10 loops=1)
              -> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.011..0.013 rows=10 loops=1)
              -> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=10)
                    Cache Key: cast_info.person_id
                    Cache Mode: logical
                    Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
                    -> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=8)
                          Index Cond: (id = cast_info.person_id)
        -> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=0.003..0.003 rows=1 loops=10)
              Index Cond: (id = cast_info.movie_id)
Planning Time: 0.418 ms
Execution Time: 0.424 ms
In [ ]:
%sql create index actoridindex on actor(id);
In [ ]:
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
In [ ]:
%sql create index movieid_castinfoindex on cast_info(movie_id);
In [ ]:
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
In [20]:
%sql drop index actoridindex;
%sql drop index movieid_castinfoindex;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
RuntimeError: (psycopg2.errors.UndefinedObject) index "actoridindex" does not exist

[SQL: drop index actoridindex;]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community
In [ ]: