# Run this cell to set up imports
import numpy as np
import pandas as pd
%reload_ext sql
There's a new jupysql version available (0.10.1), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
%sql postgresql://127.0.0.1:5432/postgres
This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture
afterwards to save space on your limited postgreSQL server.
!unzip -u data/imdb_perf_lecture.zip -d data/
Archive: data/imdb_perf_lecture.zip inflating: data/imdb_perf_lecture.sql
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture'
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql
NOTICE: database "imdb_perf_lecture" does not exist, skipping DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE psql:data/imdb_perf_lecture.sql:33: ERROR: role "yanlisa" does not exist CREATE TABLE psql:data/imdb_perf_lecture.sql:45: ERROR: role "yanlisa" does not exist CREATE TABLE psql:data/imdb_perf_lecture.sql:59: ERROR: role "yanlisa" does not exist COPY 845888 COPY 2211936 COPY 656453 ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
The meta-command \d <relation>
shows indexes maintained with the <relation>
table.
To run a SQL command, you can also look in the system view pg_indexes
(documentation 54.11):
%%sql
SELECT *
FROM pg_indexes
WHERE schemaname = 'public';
schemaname | tablename | indexname | tablespace | indexdef |
---|---|---|---|---|
public | actor | actor_pkey | None | CREATE UNIQUE INDEX actor_pkey ON public.actor USING btree (id) |
public | movie | movie_pkey | None | CREATE UNIQUE INDEX movie_pkey ON public.movie USING btree (id) |
Read the indexdef
as: the Actor relation has an index named actor_pkey
which is created on the attribute id
. In this case, the attribute id
is also the primary key of the Actor relation, hence why it has an index. More on why primary keys automatically generate indexes in a bit.
EXPLAIN ANALYZE
¶This query seems like it runs pretty quickly:
%%sql
SELECT * FROM Actor WHERE id = 23456;
id | name |
---|---|
23456 | Geraldo Alves |
The PostgreSQL command EXPLAIN ANALYZE
runs the execution plan of a statement and displays actual run time statistics. This is useful to understand what the query is actually doing.
%%sql
EXPLAIN ANALYZE SELECT * FROM Actor WHERE id = 23456;
QUERY PLAN |
---|
Index Scan using actor_pkey on actor (cost=0.42..8.44 rows=1 width=36) (actual time=0.023..0.025 rows=1 loops=1) |
Index Cond: (id = 23456) |
Planning Time: 0.061 ms |
Execution Time: 0.038 ms |
By contrast, the below query on Crew_info
runs quite slowly. Why?
%%sql
EXPLAIN ANALYZE SELECT * FROM Cast_info WHERE person_id = 23456;
QUERY PLAN |
---|
Gather (cost=1000.00..23415.29 rows=11060 width=8) (actual time=128.227..131.074 rows=3 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on cast_info (cost=0.00..21309.29 rows=4608 width=8) (actual time=49.135..125.255 rows=1 loops=3) |
Filter: (person_id = 23456) |
Rows Removed by Filter: 737311 |
Planning Time: 0.075 ms |
Execution Time: 131.100 ms |
Explanation:
Run \d
on Cast_info
and Actor
. Cast_info
does not have an index on movie_id
!
imdb_perf_lecture=# \d Actor
Table "public.actor"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
name | text | | |
Indexes:
"actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)
imdb_perf_lecture=# \d cast_info
Table "public.cast_info"
Column | Type | Collation | Nullable | Default
-----------+---------+-----------+----------+---------
person_id | integer | | |
movie_id | integer | | |
Foreign-key constraints:
"cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
"cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)
In the Actor table, name
is not a primary key. What kind of scan do you think the following query will produce?
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE name = 'Tom Hanks';
QUERY PLAN |
---|
Gather (cost=1000.00..11054.57 rows=4229 width=36) (actual time=0.363..31.063 rows=1 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on actor (cost=0.00..9631.67 rows=1762 width=36) (actual time=16.255..25.548 rows=0 loops=3) |
Filter: (name = 'Tom Hanks'::text) |
Rows Removed by Filter: 281962 |
Planning Time: 0.077 ms |
Execution Time: 31.080 ms |
We can manually create an index, even if it's not a primary key. Below, we create a multi-dimensional index just to show you the syntax:
%sql CREATE INDEX nameIdIndex ON Actor(name,id);
This makes our original query much faster:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE name = 'Tom Hanks';
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=113.20..5353.56 rows=4229 width=36) (actual time=0.063..0.064 rows=1 loops=1) |
Recheck Cond: (name = 'Tom Hanks'::text) |
Heap Blocks: exact=1 |
-> Bitmap Index Scan on nameidindex (cost=0.00..112.14 rows=4229 width=0) (actual time=0.049..0.049 rows=1 loops=1) |
Index Cond: (name = 'Tom Hanks'::text) |
Planning Time: 0.220 ms |
Execution Time: 0.085 ms |
Why "Index Only" Scan? Well, SQL correctly identified that there are only two attributes in the Actor table, and both are located in the index. So we just need to search the index; we don't need to additionally fetch any records.
SQL automatically decides whether index scans are worth it. Sometimes, it decides to do a sequential scan instead, or even a bitmap heap scan.
The below exact match lookup produces an Index Scan:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id = 23456;
QUERY PLAN |
---|
Index Scan using actor_pkey on actor (cost=0.42..8.44 rows=1 width=36) (actual time=0.014..0.015 rows=1 loops=1) |
Index Cond: (id = 23456) |
Planning Time: 0.057 ms |
Execution Time: 0.028 ms |
This range lookup also produces an Index Scan:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE 23456 <= id AND id < 23500;
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=91.77..5342.70 rows=4229 width=36) (actual time=0.010..0.011 rows=11 loops=1) |
Recheck Cond: ((23456 <= id) AND (id < 23500)) |
Heap Blocks: exact=1 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..90.71 rows=4229 width=0) (actual time=0.007..0.008 rows=11 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23500)) |
Planning Time: 0.090 ms |
Execution Time: 0.028 ms |
However, the below range lookup produces a Sequential scan!
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456;
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=5285.64..14036.18 rows=281963 width=36) (actual time=54.562..160.858 rows=838028 loops=1) |
Recheck Cond: (id >= 23456) |
Heap Blocks: exact=5180 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..5215.15 rows=281963 width=0) (actual time=53.896..53.897 rows=838028 loops=1) |
Index Cond: (id >= 23456) |
Planning Time: 0.068 ms |
Execution Time: 184.688 ms |
And this other range lookup produces a Bitmap Heap Scan??
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id < 23457;
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=5285.64..14036.18 rows=281963 width=36) (actual time=0.656..1.815 rows=7861 loops=1) |
Recheck Cond: (id < 23457) |
Heap Blocks: exact=49 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..5215.15 rows=281963 width=0) (actual time=0.635..0.635 rows=7861 loops=1) |
Index Cond: (id < 23457) |
Planning Time: 0.076 ms |
Execution Time: 2.075 ms |
Takeaway:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 AND id < 23500;
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=91.77..5342.70 rows=4229 width=36) (actual time=0.020..0.021 rows=11 loops=1) |
Recheck Cond: ((id >= 23456) AND (id < 23500)) |
Heap Blocks: exact=1 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..90.71 rows=4229 width=0) (actual time=0.015..0.015 rows=11 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23500)) |
Planning Time: 0.068 ms |
Execution Time: 0.040 ms |
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 AND id < 23457;
QUERY PLAN |
---|
Bitmap Heap Scan on actor (cost=91.77..5342.70 rows=4229 width=36) (actual time=0.023..0.024 rows=1 loops=1) |
Recheck Cond: ((id >= 23456) AND (id < 23457)) |
Heap Blocks: exact=1 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..90.71 rows=4229 width=0) (actual time=0.017..0.017 rows=1 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23457)) |
Planning Time: 0.099 ms |
Execution Time: 0.050 ms |
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 OR id < 23457;
QUERY PLAN |
---|
Seq Scan on actor (cost=0.00..17914.32 rows=469938 width=36) (actual time=0.010..79.188 rows=845888 loops=1) |
Filter: ((id >= 23456) OR (id < 23457)) |
Planning Time: 0.062 ms |
Execution Time: 103.337 ms |
We drop the newly created index just to clean things up:
%sql DROP INDEX idNameIndex;
RuntimeError: (psycopg2.errors.UndefinedObject) index "idnameindex" does not exist [SQL: DROP INDEX idNameIndex;] (Background on this error at: https://sqlalche.me/e/20/f405) If you need help solving this issue, send us a message: https://ploomber.io/community
And we close the connection, then drop the database:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'