This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture
afterwards to save space on your limited postgreSQL server.
If you didn't load it in with a previous lecture, load in the imdb_perf_lecture
database.
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/
Archive: ../lec07/data/imdb_perf_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture'
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql
DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE psql:../lec07/data/imdb_perf_lecture.sql:33: ERROR: role "yanlisa" does not exist CREATE TABLE psql:../lec07/data/imdb_perf_lecture.sql:45: ERROR: role "yanlisa" does not exist CREATE TABLE psql:../lec07/data/imdb_perf_lecture.sql:59: ERROR: role "yanlisa" does not exist COPY 845888 COPY 2211936 COPY 656453 ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE
%reload_ext sql
There's a new jupysql version available (0.10.1), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
%config SqlMagic.displaylimit = None
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
%%sql
explain analyze
select *
from actor, cast_info
where actor.id = cast_info.person_id;
QUERY PLAN |
---|
Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=1204.592..2899.130 rows=2211936 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=32.664..576.889 rows=2211936 loops=1) |
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=1171.392..1171.394 rows=845888 loops=1) |
Buckets: 65536 Batches: 16 Memory Usage: 3114kB |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=21.178..1044.234 rows=845888 loops=1) |
Planning Time: 219.930 ms |
Execution Time: 2965.872 ms |
note hash join
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.038..0.096 rows=10 loops=1) |
-> Nested Loop (cost=0.43..888554.34 rows=2211936 width=18) (actual time=0.036..0.093 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.014..0.016 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.241 ms |
Execution Time: 0.470 ms |
notice reduced time by 6000x!
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id;
QUERY PLAN |
---|
Hash Join (cost=29215.48..89168.21 rows=2211936 width=18) (actual time=287.396..1435.852 rows=2211936 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.013..223.331 rows=2211936 loops=1) |
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=287.183..287.185 rows=845888 loops=1) |
Buckets: 65536 Batches: 16 Memory Usage: 3191kB |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.067..108.863 rows=845888 loops=1) |
Planning Time: 0.212 ms |
Execution Time: 1502.998 ms |
notice not as substantial a reduction but still 2x.
%%sql
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id and actor.id > 4000000;
QUERY PLAN |
---|
Hash Join (cost=23971.10..81574.83 rows=1163852 width=18) (actual time=173.505..1135.131 rows=634763 loops=1) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..210.031 rows=2211936 loops=1) |
-> Hash (cost=15799.60..15799.60 rows=445080 width=18) (actual time=172.984..172.985 rows=444781 loops=1) |
Buckets: 65536 Batches: 8 Memory Usage: 3335kB |
-> Seq Scan on actor (cost=0.00..15799.60 rows=445080 width=18) (actual time=0.293..90.924 rows=444781 loops=1) |
Filter: (id > 4000000) |
Rows Removed by Filter: 401107 |
Planning Time: 0.163 ms |
Execution Time: 1155.587 ms |
notice that projection was pushed down below the join “at source”. If we waited until join was done, would be at least as exp
%%sql
-- repeat previous query
explain analyze
select actor.name,movie_id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=18) (actual time=0.023..0.061 rows=10 loops=1) |
-> Nested Loop (cost=0.43..888554.34 rows=2211936 width=18) (actual time=0.023..0.059 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.008..0.009 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.234 ms |
Execution Time: 0.367 ms |
%%sql
explain analyze
select actor.name,movie_id,actor.id
from actor, cast_info
where actor.id = cast_info.person_id
limit 10 ;
QUERY PLAN |
---|
Limit (cost=0.43..4.45 rows=10 width=22) (actual time=0.075..0.137 rows=10 loops=1) |
-> Nested Loop (cost=0.43..888554.34 rows=2211936 width=22) (actual time=0.074..0.134 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.011..0.013 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.011..0.011 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
Planning Time: 0.265 ms |
Execution Time: 0.445 ms |
^ width of the seq scan is same as in the previous query, while the width after the hash join is smaller in the former – this is because actor.id is already extracted for the predicate
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
QUERY PLAN |
---|
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=71.868..71.960 rows=10 loops=1) |
-> Nested Loop (cost=0.86..1893991.93 rows=2211936 width=56) (actual time=71.867..71.957 rows=10 loops=1) |
-> Nested Loop (cost=0.43..888554.34 rows=2211936 width=26) (actual time=0.021..0.090 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.007..0.009 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.007..0.007 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.006..0.007 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
-> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=7.186..7.186 rows=1 loops=10) |
Index Cond: (id = cast_info.movie_id) |
Planning Time: 0.300 ms |
Execution Time: 72.262 ms |
# ^ note 2 hash joins where actor and cast_info are joined, followed by a join with movie
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id and name='Hanks, Tom' ;
QUERY PLAN |
---|
Gather (cost=10632.10..32056.15 rows=3 width=56) (actual time=60.662..62.390 rows=0 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Nested Loop (cost=9632.10..31055.85 rows=1 width=56) (actual time=25.098..25.102 rows=0 loops=3) |
-> Parallel Hash Join (cost=9631.68..31055.40 rows=1 width=26) (actual time=25.097..25.100 rows=0 loops=3) |
Hash Cond: (cast_info.person_id = actor.id) |
-> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (never executed) |
-> Parallel Hash (cost=9631.67..9631.67 rows=1 width=18) (actual time=24.962..24.963 rows=0 loops=3) |
Buckets: 1024 Batches: 1 Memory Usage: 0kB |
-> Parallel Seq Scan on actor (cost=0.00..9631.67 rows=1 width=18) (actual time=24.891..24.891 rows=0 loops=3) |
Filter: (name = 'Hanks, Tom'::text) |
Rows Removed by Filter: 281963 |
-> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (never executed) |
Index Cond: (id = cast_info.movie_id) |
Planning Time: 138.613 ms |
Execution Time: 62.427 ms |
# ^ got rid of limit; same as before, except notice the push down of seq scan
%%sql explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id and title='Snakes on a Plane';
# repeat
# note 2 hash joins where actor and cast_info are joined, followed by a join with movie
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
QUERY PLAN |
---|
Limit (cost=0.86..9.42 rows=10 width=56) (actual time=0.041..0.119 rows=10 loops=1) |
-> Nested Loop (cost=0.86..1893991.93 rows=2211936 width=56) (actual time=0.040..0.117 rows=10 loops=1) |
-> Nested Loop (cost=0.43..888554.34 rows=2211936 width=26) (actual time=0.030..0.080 rows=10 loops=1) |
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.011..0.013 rows=10 loops=1) |
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=10) |
Cache Key: cast_info.person_id |
Cache Mode: logical |
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB |
-> Index Scan using actor_pkey on actor (cost=0.42..0.46 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=8) |
Index Cond: (id = cast_info.person_id) |
-> Index Scan using movie_pkey on movie (cost=0.42..0.45 rows=1 width=30) (actual time=0.003..0.003 rows=1 loops=10) |
Index Cond: (id = cast_info.movie_id) |
Planning Time: 0.418 ms |
Execution Time: 0.424 ms |
%sql create index actoridindex on actor(id);
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
%sql create index movieid_castinfoindex on cast_info(movie_id);
%%sql
explain analyze
select *
from actor, cast_info, movie
where actor.id = cast_info.person_id and movie.id = cast_info.movie_id
limit 10;
%sql drop index actoridindex;
%sql drop index movieid_castinfoindex;
RuntimeError: (psycopg2.errors.UndefinedObject) index "actoridindex" does not exist [SQL: drop index actoridindex;] (Background on this error at: https://sqlalche.me/e/20/f405) If you need help solving this issue, send us a message: https://ploomber.io/community