lec10

Lecture 10: Query Optimization II¶

New IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the imdb_perf_lecture database.

In [1]:

!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/

Archive:  ../lec07/data/imdb_perf_lecture.zip

In [2]:

!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

In [4]:

%reload_ext sql

In [5]:

%config SqlMagic.displaylimit = None

displaylimit: Value None will be treated as 0 (no limit)

In [6]:

%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

Demo¶

In [8]:

%%sql
explain analyze
select *
from actor, cast_info
where actor.id = cast_info.person_id;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

8 rows affected.

Out[8]:

QUERY PLAN
Hash Join (cost=30867.48..92474.12 rows=2212088 width=44) (actual time=188.708..1324.353 rows=2211936 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.032..162.140 rows=2211936 loops=1)
-> Hash (cost=13684.88..13684.88 rows=845888 width=36) (actual time=188.530..188.531 rows=845888 loops=1)
Buckets: 65536 Batches: 32 Memory Usage: 1810kB
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.012..71.312 rows=845888 loops=1)
Planning Time: 0.144 ms
Execution Time: 1411.625 ms

Note Hash Join being used

In [10]:

%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

11 rows affected.

Out[10]:

QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.040..0.319 rows=10 loops=1)
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=18) (actual time=0.040..0.317 rows=10 loops=1)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.014..0.015 rows=10 loops=1)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.029..0.030 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.035..0.035 rows=1 loops=8)
Index Cond: (id = cast_info.person_id)
Planning Time: 0.140 ms
Execution Time: 0.557 ms

LIMIT considerably speeds things up by >1000x!

Now, back to the original query...

In [12]:

%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

8 rows affected.

Out[12]:

QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=18) (actual time=228.786..1242.297 rows=2211936 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..206.976 rows=2211936 loops=1)
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=228.625..228.626 rows=845888 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 3191kB
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.048..84.107 rows=845888 loops=1)
Planning Time: 0.160 ms
Execution Time: 1331.759 ms

Let's add in a WHERE clause.

In [13]:

%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id and actor.id > 4000000;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

10 rows affected.

Out[13]:

QUERY PLAN
Hash Join (cost=23966.84..81569.56 rows=1163170 width=18) (actual time=158.393..1003.409 rows=634763 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..205.828 rows=2211936 loops=1)
-> Hash (cost=15799.60..15799.60 rows=444819 width=18) (actual time=157.922..157.924 rows=444781 loops=1)
Buckets: 65536 Batches: 8 Memory Usage: 3335kB
-> Seq Scan on actor (cost=0.00..15799.60 rows=444819 width=18) (actual time=0.209..80.203 rows=444781 loops=1)
Filter: (id > 4000000)
Rows Removed by Filter: 401107
Planning Time: 0.159 ms
Execution Time: 1028.796 ms

Notice projection pushed down at source! Time reduced by a bit as a result. Otherwise cost would be at least as expensive as previous query.

More Exercises (Not covered in class)¶

Repeat previous LIMIT query

In [14]:

%%sql
-- repeat previous query
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

11 rows affected.

Out[14]:

QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.023..0.058 rows=10 loops=1)
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=18) (actual time=0.022..0.056 rows=10 loops=1)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..0.009 rows=10 loops=1)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=8)
Index Cond: (id = cast_info.person_id)
Planning Time: 0.166 ms
Execution Time: 0.313 ms

Notice width of Index Scan on actor: 18. This is the size of the projected tuple.

What if we add another actor attribute to SELECT? Would this change the width during Index Scan?

In [15]:

%%sql
explain analyze
select actor.name,movie_id,actor.id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

11 rows affected.

Out[15]:

QUERY PLAN
Limit (cost=0.43..4.45 rows=10 width=22) (actual time=0.027..0.066 rows=10 loops=1)
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=22) (actual time=0.027..0.064 rows=10 loops=1)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.010..0.011 rows=10 loops=1)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=8)
Index Cond: (id = cast_info.person_id)
Planning Time: 0.152 ms
Execution Time: 0.370 ms

Width stays the same! This is because the newly added actor.id attribute in the SELECT is needed as part of the join condition, so it is already included (and not projected out).

But note the final width after join is now 22 instead of 18.

More In-Class Demo (Three Tables)¶

In [17]:

%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
  AND movie.id = cast_info.movie_id;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

17 rows affected.

Out[17]:

QUERY PLAN
Hash Join (cost=53334.67..153823.76 rows=2211936 width=56) (actual time=333.075..2550.472 rows=2211936 loops=1)
Hash Cond: (cast_info.movie_id = movie.id)
-> Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=176.047..1296.051 rows=2211936 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..146.379 rows=2211936 loops=1)
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=175.614..175.616 rows=845888 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 3114kB
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.022..60.767 rows=845888 loops=1)
-> Hash (cost=11425.53..11425.53 rows=656453 width=30) (actual time=156.876..156.877 rows=656453 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 3006kB
-> Seq Scan on movie (cost=0.00..11425.53 rows=656453 width=30) (actual time=0.017..50.323 rows=656453 loops=1)
Planning Time: 0.419 ms
JIT:
Functions: 17
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.061 ms, Inlining 0.000 ms, Optimization 0.479 ms, Emission 8.076 ms, Total 9.616 ms
Execution Time: 2666.455 ms

Let's add in a LIMIT

In [18]:

%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
  AND movie.id = cast_info.movie_id
LIMIT 10;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

14 rows affected.

Out[18]:

QUERY PLAN
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=0.032..0.082 rows=10 loops=1)
-> Nested Loop (cost=0.86..1893212.49 rows=2211936 width=56) (actual time=0.031..0.080 rows=10 loops=1)
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=26) (actual time=0.024..0.058 rows=10 loops=1)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.009..0.010 rows=10 loops=1)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=8)
Index Cond: (id = cast_info.person_id)
-> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = cast_info.movie_id)
Planning Time: 0.264 ms
Execution Time: 1.109 ms

Note use of Nested Loops + Indexes (on Primary Key attributes) to speed up the query.

There are no indexes on castinfo. What happens when we create one?

In [19]:

%sql CREATE INDEX movieid_castinfoindex ON cast_info(movie_id);

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

Out[19]:

In [21]:

%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
  AND movie.id = cast_info.movie_id
LIMIT 10;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

14 rows affected.

Out[21]:

QUERY PLAN
Limit (cost=0.86..6.71 rows=10 width=56) (actual time=0.024..0.087 rows=10 loops=1)
-> Nested Loop (cost=0.86..1292202.15 rows=2211936 width=56) (actual time=0.023..0.085 rows=10 loops=1)
-> Nested Loop (cost=0.43..436334.61 rows=2211936 width=38) (actual time=0.015..0.029 rows=10 loops=1)
-> Seq Scan on movie (cost=0.00..11425.53 rows=656453 width=30) (actual time=0.006..0.007 rows=3 loops=1)
-> Index Scan using movieid_castinfoindex on cast_info (cost=0.43..0.60 rows=5 width=8) (actual time=0.003..0.005 rows=3 loops=3)
Index Cond: (movie_id = movie.id)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 0 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 2kB
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10)
Index Cond: (id = cast_info.person_id)
Planning Time: 0.339 ms
Execution Time: 0.346 ms

Notice the join order is now different from before! movie->castinfo->actor as opposed to castinfo->actor->movie

In [23]:

%sql drop index movieid_castinfoindex;

Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'

Out[23]: