# Run this cell to set up imports
import numpy as np
import pandas as pd
!unzip -u data/imdb_lecture.zip -d data/
Archive: data/imdb_lecture.zip
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_lecture'
!psql -h localhost -d imdb_lecture -f data/imdb_lecture.sql
ERROR: database "imdb_lecture" is being accessed by other users DETAIL: There is 1 other session using the database. ERROR: database "imdb_lecture" already exists SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET SET SET psql:data/imdb_lecture.sql:35: ERROR: relation "akas" already exists ALTER TABLE psql:data/imdb_lecture.sql:49: ERROR: relation "crew" already exists ALTER TABLE psql:data/imdb_lecture.sql:63: ERROR: relation "episodes" already exists ALTER TABLE psql:data/imdb_lecture.sql:77: ERROR: relation "people" already exists ALTER TABLE psql:data/imdb_lecture.sql:90: ERROR: relation "ratings" already exists ALTER TABLE psql:data/imdb_lecture.sql:109: ERROR: relation "titles" already exists ALTER TABLE COPY 8405 COPY 4043 COPY 4923 COPY 1223 COPY 820 COPY 2420
psql
in Terminal¶psql
is a command-line PostgreSQL interactive client.
I find it useful to keep the Terminal up while I'm working on notebooks for the following:
psql
commands to query information (generally metadata) about the databsepsql
shows me a few rows at a time, and I can quit whenever. Avoids Jupyter notebooks running out of space if the query result relation is huge.To launch psql
and connect to a specific database, say, the imdb_lecture
database we just created on localhost
, open up a Terminal and type in:
psql postgresql://127.0.0.1:5432/imdb_lecture
Note the postgres server is on localhost (i.e., IP address 127.0.0.0.1
) and network port 5432
.
Troubleshooting:
psql
client!\d
, make sure you have created/loaded in the database with the !psql
commands in the previous section.Quick reference:
\l
list all databases available on this server\d
list all relations in this database\d tablename
list schema of tablename relation\q
quit psql\?
help<ctrl>-c
cancel<ctrl>-a
, <ctrl>-e
jump to the front and back of a line, respectively<ctrl>-<left>
, <ctrl>-<right
> jump one word previous and forward, respectively<space>
to advance a page, q
to quit and exit outjupysql
in Jupyter Notebook¶We are going to be using the jupysql
library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.
%reload_ext sql
Note we did not do import jupysql
(this will throw an error). You should always load jupysql
as the sql
cell magic, as shown above.
jupysql
helps us create a client connection directly from our Notebook. However, just like before, we first need to connect to our database before we start issuing any queries:
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture
See the slides for most of the queries, which we executed in the interactive psql
client.
We included just one query here; note we've truncated the result by randomly selecting 10 rows. This is generally good for debugging in Jupyter Notebooks (that being said, jupysql
is smart enough to truncate most results).
Be sure to remove those debugging lines before submitting any final queries for projects!
What's wrong with the following query?
%%sql
SELECT primary_title, type,
premiered AS release_year,
runtime_minutes,
runtime_minutes/60 AS
runtime_hours
FROM titles
WHERE premiered >= 2020 AND
premiered <= 2023;
primary_title | type | release_year | runtime_minutes | runtime_hours |
---|---|---|---|---|
Blood of Zeus | tvSeries | 2020 | 30 | 0 |
Gods & Heroes | tvSeries | 2020 | 30 | 0 |
Shaq Life | tvSeries | 2020 | None | None |
What's After | tvSeries | 2020 | None | None |
Utmark | tvSeries | 2020 | None | None |
La Femme Anjola | movie | 2021 | 140 | 2 |
Mr. Corman | tvSeries | 2021 | 285 | 4 |
Player Vs Player with Trevor Noah | tvSeries | 2021 | None | None |
Run for Young | tvSeries | 2020 | None | None |
Poker Nights | tvSeries | 2021 | 6 | 0 |
What's wrong with the following query?
%%sql
SELECT
person_id, name,
died, born,
died - born AS age
FROM people;
person_id | name | died | born | age |
---|---|---|---|---|
nm0384214 | Dwayne Hill | None | None | None |
nm0362443 | Dave Hardman | None | 1960 | None |
nm1560888 | Rich Pryce-Jones | None | None | None |
nm0006669 | William Sadler | None | 1950 | None |
nm1373094 | Giada De Laurentiis | None | 1970 | None |
nm7316782 | Janine Hartmann | None | None | None |
nm8671663 | Tereza Taliánová | None | 2005 | None |
nm10480297 | Chris Heywood | None | None | None |
nm10803545 | Chengao Zhou | None | None | None |
nm9849414 | Mark Langley | None | None | None |
Compare/contrast the following three queries:
We are going to be using the jupysql
library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.
%%sql
SELECT born
FROM people;
born |
---|
None |
1960 |
None |
1950 |
1970 |
None |
2005 |
None |
None |
None |
%%sql
SELECT born
FROM people
WHERE born < 2023 OR
born IS NULL;
born |
---|
None |
1960 |
None |
1950 |
1970 |
None |
2005 |
None |
None |
None |
%%sql
SELECT born
FROM people
WHERE born < 2023;
born |
---|
1960 |
1950 |
1970 |
2005 |
1980 |
1926 |
1975 |
1919 |
1940 |
1929 |
%%sql
SELECT *
FROM people
WHERE name LIKE 'Chris%';
person_id | name | born | died |
---|---|---|---|
nm10480297 | Chris Heywood | None | None |
nm9115948 | Chris Bratt | None | None |
nm6699360 | Chris Evans | None | None |
nm12363226 | Chris Daniels | None | None |
nm2653742 | Chris Longo | None | None |
nm5237685 | Chris Evans | None | None |
nm3074588 | Christian Bland | None | None |
nm5646425 | Chris Evans | None | None |
nm1470079 | Chris Boiling | None | None |
nm11632011 | Chris Angold | None | None |
%%sql
SELECT *
FROM
akas, titles
WHERE
titles.title_id =
akas.title_id;
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
%%sql
SELECT *
FROM
akas AS A,
titles T
WHERE
A.title_id = T.title_id;
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
%%sql
SELECT *
FROM akas A
INNER JOIN titles T
ON A.title_id = T.title_id
title_id | title | region | language | types | attributes | is_original_title | title_id_1 | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt0909144 | 46 | US | None | None | None | 0 | tt0909144 | tvEpisode | 46 | 46 | 0 | 1971 | None | None | Comedy,Family |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt3719148 | Gustavo Lopez/Abel Pintos | AR | None | None | None | 0 | tt3719148 | tvEpisode | Gustavo Lopez/Abel Pintos | Gustavo Lopez/Abel Pintos | 0 | 2012 | None | None | Comedy,Talk-Show |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt9047618 | The Women in the Sand | GB | None | imdbDisplay | None | 0 | tt9047618 | movie | The Women in the Sand | The Women in the Sand | 0 | 2017 | None | 73 | Documentary |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt1259521 | La cabaña del terror | AR | None | imdbDisplay | None | 0 | tt1259521 | movie | The Cabin in the Woods | The Cabin in the Woods | 0 | 2011 | None | 95 | Horror |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
tt5557622 | एपिसोड #1.15 | IN | hi | None | None | 0 | tt5557622 | tvEpisode | Episode #1.15 | Episode #1.15 | 0 | 2014 | None | 59 | Drama |
What does each record represent in the below result? Why?
%%sql
SELECT *
FROM titles
INNER JOIN crew
ON crew.title_id =
titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id;
title_id | type | primary_title | original_title | is_adult | premiered | ended | runtime_minutes | genres | title_id_1 | person_id | category | job | person_id_1 | name | born | died |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | None | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0008572 | movie | The Silent Master | The Silent Master | 0 | 1917 | None | 70 | Crime,Drama | tt0008572 | nm0913094 | actor | None | nm0913094 | Robert Warwick | 1878 | 1964 |
tt0009202 | movie | The House of Glass | The House of Glass | 0 | 1918 | None | 50 | Drama | tt0009202 | nm0154352 | director | None | nm0154352 | Emile Chautard | 1864 | 1934 |
tt0009202 | movie | The House of Glass | The House of Glass | 0 | 1918 | None | 50 | Drama | tt0009202 | nm0154352 | director | None | nm0154352 | Emile Chautard | 1864 | 1934 |
How do we modify the above query so that it gets the titles and IDs of Michelle Yeoh movies?
Let's cache your query string using some fancy jupysql
formatting:
# write your query below
# while it's bad style, for this to work,
# don't end with a semicolon.
ex1_query = """
-- fill in your query here --
"""
%%sql
{{ex1_query}};
How do we write a query that gets the names of Michelle Yeoh movies that have a rating of at least 8.0?
First, let's create a view called yeoh_movies
. More in a bit.
%%sql
CREATE VIEW yeoh_movies AS (
{{ex1_query}}
);
%sql SELECT * FROM yeoh_movies;
%%sql
SELECT primary_title
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id = yeoh_movies.title_id
WHERE rating >= 7.0;
%%sql
SELECT *
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id =
yeoh_movies.title_id
WHERE rating >= 7.0;
title_id | rating | votes | primary_title | title_id_1 |
---|---|---|---|---|
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha | tt0397535 |
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon | tt0190332 |
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha | tt0397535 |
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon | tt0190332 |
%%sql
SELECT *
FROM ratings
NATURAL JOIN yeoh_movies
WHERE rating >= 7.0;
%reload_ext sql
%%sql
SELECT DISTINCT titles.primary_title, titles.title_id
FROM titles
INNER JOIN crew
ON crew.title_id = titles.title_id
INNER JOIN people
ON people.person_id = crew.person_id
WHERE people.name = 'Morgan Freeman' AND titles.type = 'movie';
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture
!pg_dump --encoding utf8 imdb_lecture -f imdb_lecture_final.sql