This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture
afterwards to save space on your limited postgreSQL server.
If you didn't load it in with a previous lecture, load in the imdb_perf_lecture
database.
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/
Archive: ../lec07/data/imdb_perf_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture'
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql
DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE COPY 845888 COPY 2211936 COPY 656453 ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE
%reload_ext sql
There's a new jupysql version available (0.10.10), you're running 0.10.0. To upgrade: pip install jupysql --upgrade Deploy FastAPI apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
Example 1
%sql EXPLAIN ANALYZE SELECT * FROM Actor;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.038..54.412 rows=845888 loops=1) |
Planning Time: 0.041 ms |
Execution Time: 78.266 ms |
Notice ANALYZE
:
Notice EXPLAIN
:
Example 2: just planning (no execution)
%sql EXPLAIN SELECT * FROM Actor;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) |
Example 3
%sql EXPLAIN ANALYZE SELECT id FROM Actor;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.041..74.812 rows=845888 loops=1) |
Planning Time: 0.040 ms |
Execution Time: 98.669 ms |
Notice:
Example 4:
%sql EXPLAIN ANALYZE SELECT id FROM Actor WHERE id > 4000000;
QUERY PLAN |
---|
Index Only Scan using actor_pkey on actor (cost=0.42..12630.93 rows=443343 width=4) (actual time=0.067..53.516 rows=444781 loops=1) |
Index Cond: (id > 4000000) |
Heap Fetches: 0 |
Planning Time: 0.104 ms |
Execution Time: 66.626 ms |
Notice:
Example 5:
%sql EXPLAIN ANALYZE SELECT id, name FROM Actor WHERE id > 4000000;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..15799.60 rows=443343 width=18) (actual time=0.252..62.556 rows=444781 loops=1) |
Filter: (id > 4000000) |
Rows Removed by Filter: 401107 |
Planning Time: 0.064 ms |
Execution Time: 75.123 ms |
Notice:
%%sql
EXPLAIN ANALYZE -- 1
SELECT id
FROM Actor
WHERE id > 4000000 AND name='Tom Hanks';
QUERY PLAN |
---|
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=23.507..25.321 rows=0 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=20.963..20.964 rows=0 loops=3) |
Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text)) |
Rows Removed by Filter: 281963 |
Planning Time: 0.093 ms |
Execution Time: 25.337 ms |
Query 1: Compared to previous query where there was no name condition in WHERE clause, this query uses a parallel sequential scan: likely because the predicate is a bit more expensive to check, and so parallelism in processing the tuples is helpful.
Note also just one output tuple predicted, but actually no output tuples produced!
%%sql
EXPLAIN ANALYZE -- 2
SELECT id
FROM Actor
WHERE id < 4000000 AND name='Tom Hanks';
QUERY PLAN |
---|
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=0.304..25.344 rows=1 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=13.499..21.214 rows=0 loops=3) |
Filter: ((id < 4000000) AND (name = 'Tom Hanks'::text)) |
Rows Removed by Filter: 281962 |
Planning Time: 0.081 ms |
Execution Time: 25.366 ms |
Query 2: Compared to previous, we flipped the id condition, and now we've found a match for Tom Hanks! Otherwise quite similar. Note that we still need to go through all the tuples.
%%sql
EXPLAIN ANALYZE -- 3
SELECT id
FROM Actor;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.043..72.572 rows=845888 loops=1) |
Planning Time: 0.039 ms |
Execution Time: 96.430 ms |
Query 3: Resetting to an old query for a second without conditions...
%%sql
EXPLAIN ANALYZE -- 4
SELECT id
FROM Actor
LIMIT 10;
QUERY PLAN |
---|
Limit (cost=0.00..0.16 rows=10 width=4) (actual time=0.042..0.045 rows=10 loops=1) |
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.041..0.042 rows=10 loops=1) |
Planning Time: 0.073 ms |
Execution Time: 0.059 ms |
Query 4: by adding a LIMIT, see significantly reduced execution time! Also a limit node above the sequential scan node.
Next, we just change a SQLmagic setting to allow for a longer query plan to be displayed...
%config SqlMagic.displaylimit = None
%%sql
EXPLAIN ANALYZE -- 5
SELECT id
FROM Actor
ORDER BY name
LIMIT 10;
QUERY PLAN |
---|
Limit (cost=17366.94..17368.11 rows=10 width=18) (actual time=66.592..68.584 rows=10 loops=1) |
-> Gather Merge (cost=17366.94..99611.72 rows=704906 width=18) (actual time=66.591..68.581 rows=10 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=16366.92..17248.05 rows=352453 width=18) (actual time=64.110..64.112 rows=7 loops=3) |
Sort Key: name |
Sort Method: top-N heapsort Memory: 26kB |
Worker 0: Sort Method: top-N heapsort Memory: 25kB |
Worker 1: Sort Method: top-N heapsort Memory: 25kB |
-> Parallel Seq Scan on actor (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.021..19.438 rows=281963 loops=3) |
Planning Time: 0.081 ms |
Execution Time: 68.617 ms |
Query 5: Whoa! Sorts are expensive! Even though we had a LIMIT, the ORDER BY considerably increased costs.