Lecture 08: Query Optimization I¶

New IMDB Performance database¶

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the imdb_perf_lecture afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the imdb_perf_lecture database.

In [1]:
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/
Archive:  ../lec07/data/imdb_perf_lecture.zip
In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
In [3]:
%reload_ext sql
There's a new jupysql version available (0.10.10), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
Deploy FastAPI apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup
In [4]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

Demo¶


Example 1

In [7]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
3 rows affected.
Out[7]:
QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.038..54.412 rows=845888 loops=1)
Planning Time: 0.041 ms
Execution Time: 78.266 ms

Notice ANALYZE:

  • start time and end time for operator, rows processed
  • loops = number of times the operator is executed

Notice EXPLAIN:

  • width = size (in bytes) of output tuples from that operator

Example 2: just planning (no execution)

In [8]:
%sql EXPLAIN SELECT * FROM Actor;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
1 rows affected.
Out[8]:
QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=18)

Example 3

In [7]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
3 rows affected.
Out[7]:
QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.041..74.812 rows=845888 loops=1)
Planning Time: 0.040 ms
Execution Time: 98.669 ms

Notice:

  • width from 18 down to 4!
  • still 845k output tuples

Example 4:

In [9]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor WHERE id > 4000000;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[9]:
QUERY PLAN
Index Only Scan using actor_pkey on actor (cost=0.42..12630.93 rows=443343 width=4) (actual time=0.067..53.516 rows=444781 loops=1)
  Index Cond: (id > 4000000)
  Heap Fetches: 0
Planning Time: 0.104 ms
Execution Time: 66.626 ms

Notice:

  • output tuples now reduced to 444k
  • planning has imperfect estimate

Example 5:

In [10]:
%sql EXPLAIN ANALYZE SELECT id, name FROM Actor WHERE id > 4000000;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
5 rows affected.
Out[10]:
QUERY PLAN
Seq Scan on actor (cost=0.00..15799.60 rows=443343 width=18) (actual time=0.252..62.556 rows=444781 loops=1)
  Filter: (id > 4000000)
  Rows Removed by Filter: 401107
Planning Time: 0.064 ms
Execution Time: 75.123 ms

Notice:

  • width back up a bit to 18 (because of name)
  • index-only scan moving to sequential scan

Moar queries if time permits!¶

In [11]:
%%sql
EXPLAIN ANALYZE -- 1
SELECT id
FROM Actor
WHERE id > 4000000 AND name='Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[11]:
QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=23.507..25.321 rows=0 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=20.963..20.964 rows=0 loops=3)
        Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
        Rows Removed by Filter: 281963
Planning Time: 0.093 ms
Execution Time: 25.337 ms

Query 1: Compared to previous query where there was no name condition in WHERE clause, this query uses a parallel sequential scan: likely because the predicate is a bit more expensive to check, and so parallelism in processing the tuples is helpful.

Note also just one output tuple predicted, but actually no output tuples produced!

In [12]:
%%sql
EXPLAIN ANALYZE -- 2
SELECT id
FROM Actor
WHERE id < 4000000 AND name='Tom Hanks';
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
8 rows affected.
Out[12]:
QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=0.304..25.344 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=13.499..21.214 rows=0 loops=3)
        Filter: ((id < 4000000) AND (name = 'Tom Hanks'::text))
        Rows Removed by Filter: 281962
Planning Time: 0.081 ms
Execution Time: 25.366 ms

Query 2: Compared to previous, we flipped the id condition, and now we've found a match for Tom Hanks! Otherwise quite similar. Note that we still need to go through all the tuples.

In [12]:
%%sql
EXPLAIN ANALYZE -- 3
SELECT id
FROM Actor;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
3 rows affected.
Out[12]:
QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.043..72.572 rows=845888 loops=1)
Planning Time: 0.039 ms
Execution Time: 96.430 ms

Query 3: Resetting to an old query for a second without conditions...

In [13]:
%%sql
EXPLAIN ANALYZE -- 4
SELECT id
FROM Actor
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
4 rows affected.
Out[13]:
QUERY PLAN
Limit (cost=0.00..0.16 rows=10 width=4) (actual time=0.042..0.045 rows=10 loops=1)
  -> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.041..0.042 rows=10 loops=1)
Planning Time: 0.073 ms
Execution Time: 0.059 ms

Query 4: by adding a LIMIT, see significantly reduced execution time! Also a limit node above the sequential scan node.

Next, we just change a SQLmagic setting to allow for a longer query plan to be displayed...

In [14]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)
In [15]:
%%sql
EXPLAIN ANALYZE -- 5
SELECT id
FROM Actor
ORDER BY name
LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/imdb_perf_lecture'
12 rows affected.
Out[15]:
QUERY PLAN
Limit (cost=17366.94..17368.11 rows=10 width=18) (actual time=66.592..68.584 rows=10 loops=1)
  -> Gather Merge (cost=17366.94..99611.72 rows=704906 width=18) (actual time=66.591..68.581 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -> Sort (cost=16366.92..17248.05 rows=352453 width=18) (actual time=64.110..64.112 rows=7 loops=3)
              Sort Key: name
              Sort Method: top-N heapsort Memory: 26kB
              Worker 0: Sort Method: top-N heapsort Memory: 25kB
              Worker 1: Sort Method: top-N heapsort Memory: 25kB
              -> Parallel Seq Scan on actor (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.021..19.438 rows=281963 loops=3)
Planning Time: 0.081 ms
Execution Time: 68.617 ms

Query 5: Whoa! Sorts are expensive! Even though we had a LIMIT, the ORDER BY considerably increased costs.