Lecture 06¶
# Run this cell to set up imports
import numpy as np
import pandas as pd
New IMDB Performance database¶
This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture
afterwards to save space on your limited postgreSQL server.
!unzip -u data/imdb_perf_lecture.zip -d data/
Archive: data/imdb_perf_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture'
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql
DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE COPY 845888 COPY 2211936 COPY 656453 ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE
Display indexes¶
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
The meta-command \d <relation>
shows indexes maintained with the <relation>
table.
To run a SQL command, you can also look in the system view pg_indexes
(documentation 54.11):
%%sql
SELECT *
FROM pg_indexes
WHERE schemaname = 'public';
schemaname | tablename | indexname | tablespace | indexdef |
---|---|---|---|---|
public | actors | actor_pkey | None | CREATE UNIQUE INDEX actor_pkey ON public.actors USING btree (id) |
public | movies | movie_pkey | None | CREATE UNIQUE INDEX movie_pkey ON public.movies USING btree (id) |
Read the indexdef
as: the Actor relation has an index named actor_pkey
which is created on the attribute id
. In this case, the attribute id
is also the primary key of the Actor relation, hence why it has an index. More on why primary keys automatically generate indexes in a bit.
EXPLAIN ANALYZE
¶
This query seems like it runs pretty quickly:
%%sql
SELECT * FROM actors WHERE id = 23456;
id | name |
---|---|
23456 | Geraldo Alves |
The PostgreSQL command EXPLAIN ANALYZE
runs the execution plan of a statement and displays actual run time statistics. This is useful to understand what the query is actually doing.
%%sql
EXPLAIN ANALYZE SELECT * FROM actors WHERE id = 23456;
QUERY PLAN |
---|
Index Scan using actor_pkey on actors (cost=0.42..8.44 rows=1 width=18) (actual time=0.016..0.017 rows=1 loops=1) |
Index Cond: (id = 23456) |
Planning Time: 0.056 ms |
Execution Time: 0.030 ms |
By contrast, the below query on cast_info
runs quite slowly. Why?
%%sql
EXPLAIN ANALYZE SELECT * FROM cast_info WHERE person_id = 23456;
QUERY PLAN |
---|
Gather (cost=1000.00..23415.29 rows=11060 width=8) (actual time=10.284..105.366 rows=3 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on cast_info (cost=0.00..21309.29 rows=4608 width=8) (actual time=38.441..100.369 rows=1 loops=3) |
Filter: (person_id = 23456) |
Rows Removed by Filter: 737311 |
Planning Time: 0.082 ms |
Execution Time: 105.387 ms |
Explanation:
Run \d
on cast_info
and actors
. cast_info
does not have an index on movie_id
!
$ psql -h localhost -d imdb_perf_lecture
imdb_perf_lecture=# \d actors
Table "public.actors"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
name | text | | |
Indexes:
"actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)
imdb_perf_lecture=# \d cast_info
Table "public.cast_info"
Column | Type | Collation | Nullable | Default
-----------+---------+-----------+----------+---------
person_id | integer | | |
movie_id | integer | | |
Foreign-key constraints:
"cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movies(id)
"cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)
Creating new Indexes¶
In the actors table, name
is not a primary key. What kind of scan do you think the following query will produce?
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE name = 'Tom Hanks';
QUERY PLAN |
---|
Gather (cost=1000.00..10631.77 rows=1 width=18) (actual time=0.224..53.515 rows=1 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on actors (cost=0.00..9631.67 rows=1 width=18) (actual time=31.824..48.821 rows=0 loops=3) |
Filter: (name = 'Tom Hanks'::text) |
Rows Removed by Filter: 281962 |
Planning Time: 0.081 ms |
Execution Time: 53.531 ms |
We can manually create an index, even if it's not a primary key. Below, we create a multi-dimensional index just to show you the syntax:
%sql CREATE INDEX nameIdIndex ON actors(name,id);
This makes our original query much faster:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE name = 'Tom Hanks';
QUERY PLAN |
---|
Index Only Scan using nameidindex on actors (cost=0.42..4.44 rows=1 width=18) (actual time=0.043..0.044 rows=1 loops=1) |
Index Cond: (name = 'Tom Hanks'::text) |
Heap Fetches: 0 |
Planning Time: 0.241 ms |
Execution Time: 0.059 ms |
Why "Index Only" Scan? Well, SQL correctly identified that there are only two attributes in the Actor table, and both are located in the index. So we just need to search the index; we don't need to additionally fetch any records.
Exercise: Types of Scans¶
SQL automatically decides whether index scans are worth it. Sometimes, it decides to do a sequential scan instead, or even a bitmap heap scan.
The below exact match lookup produces an Index Scan:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id = 23456;
QUERY PLAN |
---|
Index Scan using actor_pkey on actors (cost=0.42..8.44 rows=1 width=18) (actual time=0.018..0.020 rows=1 loops=1) |
Index Cond: (id = 23456) |
Planning Time: 0.072 ms |
Execution Time: 0.035 ms |
This range lookup also produces an Index Scan:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE 23456 <= id AND id < 23500;
QUERY PLAN |
---|
Index Scan using actor_pkey on actors (cost=0.42..48.10 rows=14 width=18) (actual time=0.007..0.009 rows=11 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23500)) |
Planning Time: 0.149 ms |
Execution Time: 0.023 ms |
However, the below range lookup produces a Sequential scan!
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456;
QUERY PLAN |
---|
Seq Scan on actors (cost=0.00..15799.60 rows=838554 width=18) (actual time=0.165..71.750 rows=838028 loops=1) |
Filter: (id >= 23456) |
Rows Removed by Filter: 7860 |
Planning Time: 0.091 ms |
Execution Time: 97.744 ms |
And this other range lookup produces a Bitmap Heap Scan??
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE 5 <= id AND id < 23457;
QUERY PLAN |
---|
Bitmap Heap Scan on actors (cost=159.59..5767.30 rows=7333 width=18) (actual time=0.450..1.300 rows=7857 loops=1) |
Recheck Cond: ((5 <= id) AND (id < 23457)) |
Heap Blocks: exact=49 |
-> Bitmap Index Scan on actor_pkey (cost=0.00..157.75 rows=7333 width=0) (actual time=0.431..0.431 rows=7857 loops=1) |
Index Cond: ((id >= 5) AND (id < 23457)) |
Planning Time: 0.116 ms |
Execution Time: 1.531 ms |
- Index scan:
- For each index key match, there is a page fetch.
- If multiple index key matches all correspond to a single page, that single page may get fetched multiple times.matches on our query.
- Sequential scan:
- Once each page is loaded in, all records on that page are scanned in sequence.
- Bitmap heap scan:
- Pre-scans the index to identify the unique pages to visit, then sequentially scans the subset of pages
- More here: stackoverflow
Takeaway:
- There is no guarantee that records in memory are sorted in the same way as records in the index.
- Therefore index lookups are effectively random lookups! Many random lookups are typically more expensive than many sequential lookups!
Other range lookups for your practice:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 AND id < 23500;
QUERY PLAN |
---|
Index Scan using actor_pkey on actors (cost=0.42..48.10 rows=14 width=18) (actual time=0.008..0.010 rows=11 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23500)) |
Planning Time: 0.113 ms |
Execution Time: 0.021 ms |
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 AND id < 23457;
QUERY PLAN |
---|
Index Scan using actor_pkey on actors (cost=0.42..8.45 rows=1 width=18) (actual time=0.011..0.012 rows=1 loops=1) |
Index Cond: ((id >= 23456) AND (id < 23457)) |
Planning Time: 0.126 ms |
Execution Time: 0.035 ms |
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 OR id < 23457;
QUERY PLAN |
---|
Seq Scan on actors (cost=0.00..17914.32 rows=838618 width=18) (actual time=0.008..73.245 rows=845888 loops=1) |
Filter: ((id >= 23456) OR (id < 23457)) |
Planning Time: 0.111 ms |
Execution Time: 97.056 ms |
Cleanup¶
We drop the newly created index just to clean things up:
%sql DROP INDEX nameIdIndex;
And we close the connection, then drop the database:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
DROP DATABASE