Lecture 02: SQL Review, Extra¶
This extra notebook contains a lot of extra SQL practice. Many variations of queries shown below are either in the slides or in course notes.
First load in the data into the database¶
!unzip -u data/imdb_lecture.zip -d data/
Archive: data/imdb_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_lecture'
!psql -h localhost -d imdb_lecture -f data/imdb_lecture.sql
NOTICE: database "imdb_lecture" does not exist, skipping DROP DATABASE CREATE DATABASE SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE CREATE TABLE ALTER TABLE COPY 8405 COPY 4043 COPY 4923 COPY 1223 COPY 820 COPY 2420
jupysql
setup¶
%reload_ext sql
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture
CAST¶
What's wrong with the following query?
%%sql
SELECT primary_title, type,
premiered AS release_year,
runtime_minutes,
runtime_minutes/60 AS
runtime_hours
FROM titles
WHERE premiered >= 2020 AND
premiered <= 2023;
primary_title | type | release_year | runtime_minutes | runtime_hours |
---|---|---|---|---|
Blood of Zeus | tvSeries | 2020 | 30 | 0 |
Gods & Heroes | tvSeries | 2020 | 30 | 0 |
Shaq Life | tvSeries | 2020 | None | None |
What's After | tvSeries | 2020 | None | None |
Utmark | tvSeries | 2020 | None | None |
La Femme Anjola | movie | 2021 | 140 | 2 |
Mr. Corman | tvSeries | 2021 | 285 | 4 |
Player Vs Player with Trevor Noah | tvSeries | 2021 | None | None |
Run for Young | tvSeries | 2020 | None | None |
Poker Nights | tvSeries | 2021 | 6 | 0 |
CASE statements¶
What's wrong with the following query?
%%sql
SELECT
person_id, name,
died, born,
died - born AS age
FROM people;
person_id | name | died | born | age |
---|---|---|---|---|
nm0384214 | Dwayne Hill | None | None | None |
nm0362443 | Dave Hardman | None | 1960 | None |
nm1560888 | Rich Pryce-Jones | None | None | None |
nm0006669 | William Sadler | None | 1950 | None |
nm1373094 | Giada De Laurentiis | None | 1970 | None |
nm7316782 | Janine Hartmann | None | None | None |
nm8671663 | Tereza Taliánová | None | 2005 | None |
nm10480297 | Chris Heywood | None | None | None |
nm10803545 | Chengao Zhou | None | None | None |
nm9849414 | Mark Langley | None | None | None |
Null values and boolean expressions¶
Compare/contrast the following three queries:
We are going to be using the jupysql
library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.
%%sql
SELECT born
FROM people;
born |
---|
None |
1960 |
None |
1950 |
1970 |
None |
2005 |
None |
None |
None |
%%sql
SELECT born
FROM people
WHERE born < 2023 OR
born IS NULL;
born |
---|
None |
1960 |
None |
1950 |
1970 |
None |
2005 |
None |
None |
None |
%%sql
SELECT born
FROM people
WHERE born < 2023;
born |
---|
1960 |
1950 |
1970 |
2005 |
1980 |
1926 |
1975 |
1919 |
1940 |
1929 |
String matching and COUNT(*)¶
%%sql
SELECT *
FROM people
WHERE name LIKE 'Chris%';
person_id | name | born | died |
---|---|---|---|
nm10480297 | Chris Heywood | None | None |
nm9115948 | Chris Bratt | None | None |
nm6699360 | Chris Evans | None | None |
nm12363226 | Chris Daniels | None | None |
nm2653742 | Chris Longo | None | None |
nm5237685 | Chris Evans | None | None |
nm3074588 | Christian Bland | None | None |
nm5646425 | Chris Evans | None | None |
nm1470079 | Chris Boiling | None | None |
nm11632011 | Chris Angold | None | None |
Multiple relations, Aliasing¶
%%sql
SELECT *
FROM
akas, titles
WHERE
titles.title_id =
akas.title_id;
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
%%sql
SELECT *
FROM
akas AS A,
titles T
WHERE
A.title_id = T.title_id;
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
%%sql
SELECT *
FROM akas A
INNER JOIN titles T
ON A.title_id = T.title_id
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
IMDB exercise 1¶
What does each record represent in the below result? Why?
%%sql
SELECT *
FROM titles
INNER JOIN crew
ON crew.title_id =
titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id;
title_id | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres | title_id_1 | person_id | category | job | person_id_1 | name | born | died |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0009202 | movie | The House of Glass | The House of Glass | 0 | 1918 | None | 50 | Drama | tt0009202 | nm0154352 | director | None | nm0154352 | Emile Chautard | 1864 | 1934 |
tt0009202 | movie | The House of Glass | The House of Glass | 0 | 1918 | None | 50 | Drama | tt0009202 | nm0154352 | director | None | nm0154352 | Emile Chautard | 1864 | 1934 |
How do we modify the above query so that it gets the titles and IDs of Michelle Yeoh movies?
Let's cache your query string using some fancy jupysql
formatting:
# write your query below
# while it's bad style, for this to work,
# don't end with a semicolon.
ex1_query = """
-- fill in your query here --
"""
%%sql
{{ex1_query}};
IMDB exercise 2¶
How do we write a query that gets the names of Michelle Yeoh movies that have a rating of at least 8.0?
First, let's create a view called yeoh_movies
. More in a bit.
%%sql
CREATE VIEW yeoh_movies AS (
{{ex1_query}}
);
%sql SELECT * FROM yeoh_movies;
%%sql
SELECT primary_title
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id = yeoh_movies.title_id
WHERE rating >= 7.0;
Quick Peek: The Natural Join¶
%%sql
SELECT *
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id =
yeoh_movies.title_id
WHERE rating >= 7.0;
title_id | rating | votes | primary_title | title_id_1 |
---|---|---|---|---|
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha | tt0397535 |
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha | tt0397535 |
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon | tt0190332 |
%%sql
SELECT *
FROM ratings
NATURAL JOIN yeoh_movies
WHERE rating >= 7.0;
%reload_ext sql
%%sql
SELECT DISTINCT titles.primary_title, titles.title_id
FROM titles
INNER JOIN crew
ON crew.title_id = titles.title_id
INNER JOIN people
ON people.person_id = crew.person_id
WHERE people.name = 'Morgan Freeman' AND titles.type = 'movie';
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture
Tricky Queries¶
What do these queries do?
%%sql
SELECT REGEXP_REPLACE(name, '(.*) (.*)', '\1')
as firstname,
COUNT(*) as countname
FROM people
GROUP BY firstname
ORDER BY countname DESC;
%%sql
SELECT type,
AVG (CASE WHEN premiered < 2000 THEN runtime_minutes
ELSE NULL
END) AS pre2k_avg, AVG (CASE WHEN premiered >= 2000 THEN runtime_minutes
ELSE NULL
END) AS post2k_avg
FROM titles
GROUP BY type;