This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture
afterwards to save space on your limited postgreSQL server.
If you didn't load it in with a previous lecture, load in the imdb_perf_lecture
database.
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/
Archive: ../lec07/data/imdb_perf_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture'
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql
DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE COPY 845888 COPY 2211936 COPY 656453 ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE
%reload_ext sql
%config SqlMagic.displaylimit = None
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
%%sql
explain analyze
select *
from actor, cast_info
where actor.id = cast_info.person_id;
QUERY PLAN |
---|
Hash Join (cost=30867.48..92474.12 rows=2212088 width=44) (actual time=188.708..1324.353 rows=2211936 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31908.88 rows=2212088 width=8) (actual time=0.032..162.140 rows=2211936 loops=1) |
-> Hash (cost=13684.88..13684.88 rows=845888 width=36) (actual time=188.530..188.531 rows=845888 loops=1) |
Buckets: 65536 Batches: 32 Memory Usage: 1810kB |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.012..71.312 rows=845888 loops=1) |
Planning Time: 0.144 ms |
Execution Time: 1411.625 ms |
Note Hash Join being used
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.040..0.319 rows=10 loops=1) |
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=18) (actual time=0.040..0.317 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.014..0.015 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.029..0.030 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.035..0.035 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.140 ms |
Execution Time: 0.557 ms |
LIMIT considerably speeds things up by >1000x!
Now, back to the original query...
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id;
QUERY PLAN |
---|
Hash Join (cost=29215.48..89168.21 rows=2211936 width=18) (actual time=228.786..1242.297 rows=2211936 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..206.976 rows=2211936 loops=1) |
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=228.625..228.626 rows=845888 loops=1) |
Buckets: 65536 Batches: 16 Memory Usage: 3191kB |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.048..84.107 rows=845888 loops=1) |
Planning Time: 0.160 ms |
Execution Time: 1331.759 ms |
Let's add in a WHERE clause.
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id and actor.id > 4000000;
QUERY PLAN |
---|
Hash Join (cost=23966.84..81569.56 rows=1163170 width=18) (actual time=158.393..1003.409 rows=634763 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..205.828 rows=2211936 loops=1) |
-> Hash (cost=15799.60..15799.60 rows=444819 width=18) (actual time=157.922..157.924 rows=444781 loops=1) |
Buckets: 65536 Batches: 8 Memory Usage: 3335kB |
-> Seq Scan on actor (cost=0.00..15799.60 rows=444819 width=18) (actual time=0.209..80.203 rows=444781 loops=1) |
Filter: (id > 4000000) |
Rows Removed by Filter: 401107 |
Planning Time: 0.159 ms |
Execution Time: 1028.796 ms |
Notice projection pushed down at source! Time reduced by a bit as a result. Otherwise cost would be at least as expensive as previous query.
Repeat previous LIMIT query
%%sql
-- repeat previous query
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.023..0.058 rows=10 loops=1) |
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=18) (actual time=0.022..0.056 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..0.009 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.166 ms |
Execution Time: 0.313 ms |
Notice width of Index Scan on actor: 18. This is the size of the projected tuple.
What if we add another actor attribute to SELECT? Would this change the width during Index Scan?
%%sql
explain analyze
select actor.name,movie_id,actor.id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=22) (actual time=0.027..0.066 rows=10 loops=1) |
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=22) (actual time=0.027..0.064 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.010..0.011 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.152 ms |
Execution Time: 0.370 ms |
Width stays the same! This is because the newly added actor.id
attribute in the SELECT
is needed as part of the join condition, so it is already included (and not projected out).
But note the final width after join is now 22 instead of 18.
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
AND movie.id = cast_info.movie_id;
QUERY PLAN |
---|
Hash Join (cost=53334.67..153823.76 rows=2211936 width=56) (actual time=333.075..2550.472 rows=2211936 loops=1) |
Hash Cond: (cast_info.movie_id = movie.id) |
-> Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=176.047..1296.051 rows=2211936 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..146.379 rows=2211936 loops=1) |
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=175.614..175.616 rows=845888 loops=1) |
Buckets: 65536 Batches: 16 Memory Usage: 3114kB |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.022..60.767 rows=845888 loops=1) |
-> Hash (cost=11425.53..11425.53 rows=656453 width=30) (actual time=156.876..156.877 rows=656453 loops=1) |
Buckets: 65536 Batches: 16 Memory Usage: 3006kB |
-> Seq Scan on movie (cost=0.00..11425.53 rows=656453 width=30) (actual time=0.017..50.323 rows=656453 loops=1) |
Planning Time: 0.419 ms |
JIT: |
Functions: 17 |
Options: Inlining false, Optimization false, Expressions true, Deforming true |
Timing: Generation 1.061 ms, Inlining 0.000 ms, Optimization 0.479 ms, Emission 8.076 ms, Total 9.616 ms |
Execution Time: 2666.455 ms |
Let's add in a LIMIT
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
AND movie.id = cast_info.movie_id
LIMIT 10;
QUERY PLAN |
---|
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=0.032..0.082 rows=10 loops=1) |
-> Nested Loop (cost=0.86..1893212.49 rows=2211936 width=56) (actual time=0.031..0.080 rows=10 loops=1) |
-> Nested Loop (cost=0.43..887774.90 rows=2211936 width=26) (actual time=0.024..0.058 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.009..0.010 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
-> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=0.002..0.002 rows=1 loops=10) |
Index Cond: (id = cast_info.movie_id) |
Planning Time: 0.264 ms |
Execution Time: 1.109 ms |
Note use of Nested Loops + Indexes (on Primary Key attributes) to speed up the query.
There are no indexes on castinfo. What happens when we create one?
%sql CREATE INDEX movieid_castinfoindex ON cast_info(movie_id);
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info, movie
WHERE actor.id = cast_info.person_id
AND movie.id = cast_info.movie_id
LIMIT 10;
QUERY PLAN |
---|
Limit (cost=0.86..6.71 rows=10 width=56) (actual time=0.024..0.087 rows=10 loops=1) |
-> Nested Loop (cost=0.86..1292202.15 rows=2211936 width=56) (actual time=0.023..0.085 rows=10 loops=1) |
-> Nested Loop (cost=0.43..436334.61 rows=2211936 width=38) (actual time=0.015..0.029 rows=10 loops=1) |
-> Seq Scan on movie (cost=0.00..11425.53 rows=656453 width=30) (actual time=0.006..0.007 rows=3 loops=1) |
-> Index Scan using movieid_castinfoindex on cast_info (cost=0.43..0.60 rows=5 width=8) (actual time=0.003..0.005 rows=3 loops=3) |
Index Cond: (movie_id = movie.id) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 0 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 2kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.339 ms |
Execution Time: 0.346 ms |
Notice the join order is now different from before! movie->castinfo->actor as opposed to castinfo->actor->movie
%sql drop index movieid_castinfoindex;