Lecture 04: Subqueries, Aggregation¶

In [1]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

Today's database is a modified version of the Oakland police stops database as made public by the Stanford Open Policing Project: https://openpolicing.stanford.edu/data/ Read more about the database here.

If you are looking for how we adjusted the data, contact us and we can share our source notebook.

In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS stops_lecture'
!psql -h localhost -c 'CREATE DATABASE stops_lecture' 
!psql -h localhost -d stops_lecture -f data/stops_lecture.sql
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 84779
COPY 211
CREATE INDEX
CREATE INDEX

jupysql¶

We are going to be using the jupysql library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.

In [3]:
%load_ext sql
In [4]:
%sql postgresql://127.0.0.1:5432/stops_lecture
In [5]:
%sql SELECT * FROM Zips LIMIT 10;
Running query in 'postgresql://127.0.0.1:5432/stops_lecture'
10 rows affected.
Out[5]:
zip_id location zipcode
0 None 94546
1 Chabot Park 94546
2 Panoramic Hill 94563
3 None 94577
4 North Stonehurst 94577
5 Arrowhead Marsh 94601
6 Bartlett 94601
7 Coliseum Industrial 94601
8 Fairfax 94601
9 Fairfax Business-Wentworth-Holland 94601
Truncated to displaylimit of 10.

With cell magic (%%sql), we can also assign the result to a variable. Try adding result << to the top line of the below cell, i.e., replace the first line with

%%sql result <<

In [ ]:
%%sql
SELECT *
FROM Stops
ORDER BY RANDOM()
LIMIT 10;
In [ ]:
result

Exercise #1¶

In [ ]:
%%sql
SELECT *
FROM Stops
WHERE id = 123;
In [ ]:
%%sql
SELECT S1.id, S1.race,
       S1.location
FROM Stops as S1,
     Stops as S2
WHERE S1.location = S2.location 
      AND S2.id = 123;

Approach #2 with CTE:

In [ ]:
%%sql
WITH Location123 AS (
    SELECT location
    FROM Stops
    WHERE id = 123
)
SELECT S.id, S.race,
       S.location
FROM Stops as S,
    Location123
WHERE S.location = Location123.location;

Approach #3 with scalar subquery:

In [ ]:
%%sql
SELECT S1.id, S1.race, S1.location
FROM Stops S1
WHERE S1.location = (
         SELECT S2.location
         FROM Stops
         S2 WHERE S2.id = 123);

EXISTS Subquery (and NOT EXISTS)¶

In [ ]:
%%sql
SELECT DISTINCT Stops.location FROM Stops
WHERE NOT EXISTS (
  SELECT * FROM Zips
  WHERE Zips.location = 
          Stops.location);

Exercise #2¶

In [ ]:
%%sql
WITH StopZips AS (
    SELECT * FROM Stops NATURAL JOIN Zips
)
SELECT *
FROM StopZips SZ1
WHERE NOT EXISTS (
    ...
);

Note -- there are no stops that re the only one in their zipcode:

In [ ]:
%%sql
SELECT COUNT(*) as count
FROM Stops S1
  NATURAL JOIN Zips Z1
GROUP BY Z1.zipcode
ORDER BY count ASC;

Exercise #3¶

In [ ]:
%%sql
SELECT S1.id
FROM Stops S1 
WHERE S1.age >= 
	(...);

Exercise #4¶

In [ ]:
%%sql
SELECT race, 
AVG (...) AS west_oakland_avg, 
AVG (...) AS rockridge_avg 
FROM ...
...;