Lecture 07¶

In [1]:
# Run this cell to set up imports
import numpy as np
import pandas as pd
In [2]:
%reload_ext sql
In [3]:
%sql postgresql://127.0.0.1:5432/postgres
In [4]:
!unzip -u data/imdb_perf_lecture.zip -d data/
Archive:  data/imdb_perf_lecture.zip

New IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

In [5]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

Display indexes¶

In [6]:
%reload_ext sql
In [7]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
Connecting and switching to connection postgresql://127.0.0.1:5432/imdb_perf_lecture
In [8]:
%sqlcmd tables
Out[8]:
Name
movie
cast_info
actor
In [9]:
%sqlcmd columns -t actor
Out[9]:
name type nullable default autoincrement comment
id INTEGER False None False None
name TEXT True None False None

If you're in psql, the meta-command \d <relation> shows indexes maintained with the <relation> table.

In JupySQL, to find all the indexes, we look in the system view pg_indexes (documentation 54.11):

In [10]:
%%sql
SELECT *
FROM pg_indexes
WHERE schemaname = 'public';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
2 rows affected.
Out[10]:
schemaname tablename indexname tablespace indexdef
public actor actor_pkey None CREATE UNIQUE INDEX actor_pkey ON public.actor USING btree (id)
public movie movie_pkey None CREATE UNIQUE INDEX movie_pkey ON public.movie USING btree (id)

Read the indexdef as: the Actor relation has an index named actor_pkey which is created on the attribute id. In this case, the attribute id is also the primary key of the Actor relation, hence why it has an index. More on why primary keys automatically generate indexes in a bit.

EXPLAIN ANALYZE¶

This query seems like it runs pretty quickly:

In [11]:
%%sql
SELECT * FROM Actor WHERE id = 23456;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
1 rows affected.
Out[11]:
id name
23456 Geraldo Alves

The PostgreSQL command EXPLAIN ANALYZE runs the execution plan of a statement and displays actual run time statistics. This is useful to understand what the query is actually doing.

In [12]:
%%sql
EXPLAIN ANALYZE SELECT * FROM Actor WHERE id = 23456;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
4 rows affected.
Out[12]:
QUERY PLAN
Index Scan using actor_pkey on actor (cost=0.42..8.44 rows=1 width=36) (actual time=0.013..0.014 rows=1 loops=1)
  Index Cond: (id = 23456)
Planning Time: 0.050 ms
Execution Time: 0.026 ms

By contrast, the below query on Crew_info runs quite slowly. Why?

In [13]:
%%sql
EXPLAIN ANALYZE SELECT * FROM Cast_info WHERE person_id = 23456;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[13]:
QUERY PLAN
Gather (cost=1000.00..22310.10 rows=16 width=8) (actual time=102.617..104.600 rows=3 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on cast_info (cost=0.00..21308.50 rows=7 width=8) (actual time=68.875..100.138 rows=1 loops=3)
        Filter: (person_id = 23456)
        Rows Removed by Filter: 737311
Planning Time: 0.110 ms
Execution Time: 104.618 ms

Explanation:

Run \d on Cast_info and Actor. Cast_info does not have an index on movie_id!

imdb_perf_lecture=# \d Actor
               Table "public.actor"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)


imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

Creating new Indexes¶

In the Actor table, name is not a primary key. What kind of scan do you think the following query will produce?

In [14]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE name = 'Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[14]:
QUERY PLAN
Gather (cost=1000.00..10631.77 rows=1 width=18) (actual time=0.221..25.475 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on actor (cost=0.00..9631.67 rows=1 width=18) (actual time=13.453..21.182 rows=0 loops=3)
        Filter: (name = 'Tom Hanks'::text)
        Rows Removed by Filter: 281962
Planning Time: 0.143 ms
Execution Time: 25.492 ms

We can manually create an index, even if it's not a primary key. Below, we create a multi-dimensional index just to show you the syntax:

In [15]:
%sql CREATE INDEX nameIdIndex ON Actor(name,id);
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[15]:

This makes our original query much faster:

In [16]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE name = 'Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[16]:
QUERY PLAN
Index Only Scan using nameidindex on actor (cost=0.42..4.44 rows=1 width=18) (actual time=0.044..0.045 rows=1 loops=1)
  Index Cond: (name = 'Tom Hanks'::text)
  Heap Fetches: 0
Planning Time: 0.217 ms
Execution Time: 0.059 ms

Why "Index Only" Scan? Well, SQL correctly identified that there are only two attributes in the Actor table, and both are located in the index. So we just need to search the index; we don't need to additionally fetch any records.

Exercise: Types of Scans¶

SQL automatically decides whether index scans are worth it. Sometimes, it decides to do a sequential scan instead, or even a bitmap heap scan.


The below exact match lookup produces an Index Scan:

In [17]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id = 23456;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
4 rows affected.
Out[17]:
QUERY PLAN
Index Scan using actor_pkey on actor (cost=0.42..8.44 rows=1 width=18) (actual time=0.015..0.016 rows=1 loops=1)
  Index Cond: (id = 23456)
Planning Time: 0.074 ms
Execution Time: 0.029 ms

This range lookup also produces an Index Scan:

In [18]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE 23456 <= id AND id < 23500;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
4 rows affected.
Out[18]:
QUERY PLAN
Index Scan using actor_pkey on actor (cost=0.42..36.00 rows=10 width=18) (actual time=0.009..0.011 rows=11 loops=1)
  Index Cond: ((id >= 23456) AND (id < 23500))
Planning Time: 0.101 ms
Execution Time: 0.022 ms

However, the below range lookup produces a Sequential scan!

In [19]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[19]:
QUERY PLAN
Seq Scan on actor (cost=0.00..15799.60 rows=837134 width=18) (actual time=0.191..71.244 rows=838028 loops=1)
  Filter: (id >= 23456)
  Rows Removed by Filter: 7860
Planning Time: 0.059 ms
Execution Time: 94.883 ms

And this other range lookup produces a Bitmap Heap Scan??

In [20]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE 5 <= id AND id < 23457;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
7 rows affected.
Out[20]:
QUERY PLAN
Bitmap Heap Scan on actor (cost=190.14..5731.79 rows=8753 width=18) (actual time=0.445..1.287 rows=7857 loops=1)
  Recheck Cond: ((5 <= id) AND (id < 23457))
  Heap Blocks: exact=49
  -> Bitmap Index Scan on actor_pkey (cost=0.00..187.96 rows=8753 width=0) (actual time=0.425..0.426 rows=7857 loops=1)
        Index Cond: ((id >= 5) AND (id < 23457))
Planning Time: 0.166 ms
Execution Time: 1.524 ms
  • Index scan:
    • For each index key match, there is a page fetch.
    • If multiple index key matches all correspond to a single page, that single page may get fetched multiple times based on the matches for the same query.
  • Sequential scan:
    • Once each page is loaded in, all records on that page are scanned in sequence.
  • Bitmap heap scan:
    • Pre-scans the index to identify the unique pages to visit, then sequentially scans the subset of pages
    • More here: stackoverflow





Takeaway:

  • There is no guarantee that records on disk are sorted in the same way as records in the index.
  • Therefore index lookups are effectively random lookups! Many random lookups are typically more expensive than many sequential lookups!


Other range lookups for your practice:
In [ ]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 AND id < 23500;
In [ ]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 AND id < 23457;
In [ ]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor WHERE id >= 23456 OR id < 23457;

Cleanup¶

We drop the newly created index just to clean things up:

In [21]:
%sql DROP INDEX nameIdIndex;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
Out[21]:

And we close the connection, then drop the database:

In [22]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture
In [23]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
DROP DATABASE