Lecture 02: SQL Review, Extra¶

This extra notebook contains a lot of extra SQL practice. Many variations of queries shown below are either in the slides or in course notes.

First load in the data into the database¶

In [2]:
!unzip -u data/imdb_lecture.zip -d data/
Archive:  data/imdb_lecture.zip
In [3]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_lecture' 
!psql -h localhost -d imdb_lecture -f data/imdb_lecture.sql
NOTICE:  database "imdb_lecture" does not exist, skipping
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 8405
COPY 4043
COPY 4923
COPY 1223
COPY 820
COPY 2420

jupysql setup¶

In [4]:
%reload_ext sql
In [6]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture

CAST¶

What's wrong with the following query?

In [6]:
%%sql
SELECT primary_title, type,
     premiered AS release_year,
     runtime_minutes,
     runtime_minutes/60 AS 
         runtime_hours
FROM titles
WHERE premiered >= 2020 AND
      premiered <= 2023;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
230 rows affected.
Out[6]:
primary_title type release_year runtime_minutes runtime_hours
Blood of Zeus tvSeries 2020 30 0
Gods & Heroes tvSeries 2020 30 0
Shaq Life tvSeries 2020 None None
What's After tvSeries 2020 None None
Utmark tvSeries 2020 None None
La Femme Anjola movie 2021 140 2
Mr. Corman tvSeries 2021 285 4
Player Vs Player with Trevor Noah tvSeries 2021 None None
Run for Young tvSeries 2020 None None
Poker Nights tvSeries 2021 6 0
Truncated to displaylimit of 10.

CASE statements¶

What's wrong with the following query?

In [7]:
%%sql
SELECT
    person_id, name,
    died, born,
    died - born AS age                                                                     
FROM people;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
1223 rows affected.
Out[7]:
person_id name died born age
nm0384214 Dwayne Hill None None None
nm0362443 Dave Hardman None 1960 None
nm1560888 Rich Pryce-Jones None None None
nm0006669 William Sadler None 1950 None
nm1373094 Giada De Laurentiis None 1970 None
nm7316782 Janine Hartmann None None None
nm8671663 Tereza Taliánová None 2005 None
nm10480297 Chris Heywood None None None
nm10803545 Chengao Zhou None None None
nm9849414 Mark Langley None None None
Truncated to displaylimit of 10.

Null values and boolean expressions¶

Compare/contrast the following three queries:

We are going to be using the jupysql library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.

In [8]:
%%sql
SELECT born
FROM people;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
2446 rows affected.
Out[8]:
born
None
1960
None
1950
1970
None
2005
None
None
None
Truncated to displaylimit of 10.
In [9]:
%%sql
SELECT born
FROM people
WHERE born < 2023 OR
    born IS NULL;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
2446 rows affected.
Out[9]:
born
None
1960
None
1950
1970
None
2005
None
None
None
Truncated to displaylimit of 10.
In [10]:
%%sql
SELECT born
FROM people
WHERE born < 2023;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
1154 rows affected.
Out[10]:
born
1960
1950
1970
2005
1980
1926
1975
1919
1940
1929
Truncated to displaylimit of 10.

String matching and COUNT(*)¶

In [11]:
%%sql
SELECT *
FROM people
WHERE name LIKE 'Chris%';
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
168 rows affected.
Out[11]:
person_id name born died
nm10480297 Chris Heywood None None
nm9115948 Chris Bratt None None
nm6699360 Chris Evans None None
nm12363226 Chris Daniels None None
nm2653742 Chris Longo None None
nm5237685 Chris Evans None None
nm3074588 Christian Bland None None
nm5646425 Chris Evans None None
nm1470079 Chris Boiling None None
nm11632011 Chris Angold None None
Truncated to displaylimit of 10.

Multiple relations, Aliasing¶

In [12]:
%%sql
SELECT *
FROM
    akas, titles
WHERE
    titles.title_id = 
        akas.title_id;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
35268 rows affected.
Out[12]:
title_id title region language types attributes is_original_title title_id_1 type primary_title original_title is_adult premiered ended runtime_minutes genres
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
Truncated to displaylimit of 10.
In [13]:
%%sql
SELECT *
FROM
    akas AS A,
    titles T
WHERE
    A.title_id = T.title_id;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
35268 rows affected.
Out[13]:
title_id title region language types attributes is_original_title title_id_1 type primary_title original_title is_adult premiered ended runtime_minutes genres
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
Truncated to displaylimit of 10.
In [14]:
%%sql
SELECT *
FROM akas A
  INNER JOIN titles T
    ON A.title_id = T.title_id
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
35268 rows affected.
Out[14]:
title_id title region language types attributes is_original_title title_id_1 type primary_title original_title is_adult premiered ended runtime_minutes genres
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt0909144 46 US None None None 0 tt0909144 tvEpisode 46 46 0 1971 None None Comedy,Family
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt3719148 Gustavo Lopez/Abel Pintos AR None None None 0 tt3719148 tvEpisode Gustavo Lopez/Abel Pintos Gustavo Lopez/Abel Pintos 0 2012 None None Comedy,Talk-Show
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt9047618 The Women in the Sand GB None imdbDisplay None 0 tt9047618 movie The Women in the Sand The Women in the Sand 0 2017 None 73 Documentary
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt1259521 La cabaña del terror AR None imdbDisplay None 0 tt1259521 movie The Cabin in the Woods The Cabin in the Woods 0 2011 None 95 Horror
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
tt5557622 एपिसोड #1.15 IN hi None None 0 tt5557622 tvEpisode Episode #1.15 Episode #1.15 0 2014 None 59 Drama
Truncated to displaylimit of 10.

IMDB exercise 1¶

What does each record represent in the below result? Why?

In [15]:
%%sql
SELECT *
FROM titles

  INNER JOIN crew 
    ON crew.title_id = 
       titles.title_id


  INNER JOIN people
	 ON people.person_id = 
       crew.person_id;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
25368 rows affected.
Out[15]:
title_id type primary_title original_title is_adult premiered ended runtime_minutes genres title_id_1 person_id category job person_id_1 name born died
tt0008572 movie The Silent Master The Silent Master 0 1917 None None Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None None Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None 70 Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None 70 Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None None Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None None Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None 70 Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0008572 movie The Silent Master The Silent Master 0 1917 None 70 Crime,Drama tt0008572 nm0913094 actor None nm0913094 Robert Warwick 1878 1964
tt0009202 movie The House of Glass The House of Glass 0 1918 None 50 Drama tt0009202 nm0154352 director None nm0154352 Emile Chautard 1864 1934
tt0009202 movie The House of Glass The House of Glass 0 1918 None 50 Drama tt0009202 nm0154352 director None nm0154352 Emile Chautard 1864 1934
Truncated to displaylimit of 10.

How do we modify the above query so that it gets the titles and IDs of Michelle Yeoh movies?

Let's cache your query string using some fancy jupysql formatting:

In [16]:
# write your query below
# while it's bad style, for this to work,
# don't end with a semicolon.
ex1_query = """
-- fill in your query here --
"""
In [ ]:
%%sql
{{ex1_query}};

IMDB exercise 2¶

How do we write a query that gets the names of Michelle Yeoh movies that have a rating of at least 8.0?

First, let's create a view called yeoh_movies. More in a bit.

In [ ]:
%%sql
CREATE VIEW yeoh_movies AS (
  {{ex1_query}}
);
In [ ]:
%sql SELECT * FROM yeoh_movies;
In [ ]:
%%sql
SELECT primary_title
FROM ratings
INNER JOIN yeoh_movies
  ON ratings.title_id = yeoh_movies.title_id
WHERE rating >= 7.0;

Quick Peek: The Natural Join¶

In [19]:
%%sql
SELECT *
FROM ratings
INNER JOIN yeoh_movies
  ON ratings.title_id = 
   yeoh_movies.title_id
WHERE rating >= 7.0;
Running query in 'postgresql://jovyan@127.0.0.1:5432/imdb_lecture'
6 rows affected.
Out[19]:
title_id rating votes primary_title title_id_1
tt0190332 7.9 268227 Crouching Tiger, Hidden Dragon tt0190332
tt0397535 7.4 132457 Memoirs of a Geisha tt0397535
tt0190332 7.8 241690 Crouching Tiger, Hidden Dragon tt0190332
tt0190332 7.9 268227 Crouching Tiger, Hidden Dragon tt0190332
tt0397535 7.4 132457 Memoirs of a Geisha tt0397535
tt0190332 7.8 241690 Crouching Tiger, Hidden Dragon tt0190332
In [ ]:
%%sql
SELECT *
FROM ratings
NATURAL JOIN yeoh_movies
WHERE rating >= 7.0;
In [ ]:
%reload_ext sql
In [ ]:
%%sql
SELECT DISTINCT titles.primary_title, titles.title_id
FROM titles
    INNER JOIN crew
        ON crew.title_id = titles.title_id
    INNER JOIN people
        ON people.person_id = crew.person_id
WHERE people.name = 'Morgan Freeman' AND titles.type = 'movie';
In [ ]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture

Tricky Queries¶

What do these queries do?

In [ ]:
%%sql
SELECT REGEXP_REPLACE(name, '(.*) (.*)', '\1') 
         as firstname,
       COUNT(*) as countname
FROM people
GROUP BY firstname
ORDER BY countname DESC;
In [ ]:
%%sql
SELECT type, 
AVG (CASE WHEN premiered < 2000 THEN runtime_minutes              
     ELSE NULL
     END) AS pre2k_avg, AVG (CASE WHEN premiered >= 2000 THEN runtime_minutes             
     ELSE NULL
     END) AS post2k_avg 
FROM titles
GROUP BY type;