Final Project Tips and Resources

⚠️ The world of data engineering is vast. We don’t have experience with all, or even most of these tools. Part of the goals of this are to help you learn how to learn new tools. This is not an exhaustive guide; but might serve as a useful set of pointers to other tools.

Table of contents

  1. Installing Software Locally
    1. Using Homebrew
      1. Postgres & Homebrew
  2. Non-Relational DBs
  3. ER Diagrams
    1. Tools for Creating ER Diagrams
  4. GitHub & JupyterHub
  5. Using SQL & Python

Installing Software Locally

We recommend everyone install Postgres locally, following their documentation.

Using Homebrew

Homebrew (brew) is a package manager for macOS (and Linux) that simplifies the installation of software. You might find brew useful since it can install many different pieces of software.

Checkout https://brew.sh

However, if you’re comfortable with Conda, you can also use that. Otherwise, we tend to find brew to be easier on macOS.

Postgres & Homebrew

To install PostgreSQL using Homebrew, run the following command in your terminal:

brew install postgresql

After installation, you can start the PostgreSQL service with:

brew services start postgresql

And stop it with:

brew services stop postgresql

Non-Relational DBs

  • Cassandra: A highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers. Documentation
  • Redis: An in-memory key-value store known for its speed and versatility. Documentation
  • CouchDB: A database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API. Documentation
  • Neo4j: A graph database that uses graph structures for semantic queries with nodes, edges, and properties. Documentation
  • DynamoDB: A fully managed proprietary NoSQL database service provided by AWS. Documentation

ER Diagrams

Tools for Creating ER Diagrams

  • Lucidchart: A web-based diagramming tool that allows you to create ER diagrams and other types of diagrams. Website
  • Freeform App: A (collaborative) iPad and Mac app which allows you to draw diagrams and take notes. It’s not specific to ER diagrams, but there’s a good chance you already have it. Website
  • dbdiagram.io: An online tool for creating database diagrams by writing code. Website

GitHub & JupyterHub

Because your repository is private, we recommend setting up SSH if you are working on JupyterHub.

Using SQL & Python

Using psycopg3 to Connect to PostgreSQL, Create a Table, and Insert Data

To interact with PostgreSQL using Python, you can use the psycopg3 library. Below is a short example demonstrating how to connect to a database, create a table, and insert some data.

First, install the psycopg3 library if you haven’t already: For more information on psycopg3, you can refer to the following resources:

pip install psycopg[binary]

Then, use the following Python code:

import psycopg

# Connect to your PostgreSQL database
# Make sure it is already running on your computer.
# (This is done by default on JupyterHub)
conn = psycopg.connect("localhost")

# Create a cursor object
cur = conn.cursor()

# Create a table
cur.execute("""
  CREATE TABLE IF NOT EXISTS users (
    id SERIAL PRIMARY KEY,
    name TEXT,
    age INT
  )
""")

# Insert some data into the table
cur.execute("""
  INSERT INTO users (name, age) VALUES
  ('Alice', 30),
  ('Bob', 25),
  ('Charlie', 35)
""")

# Insert data using safe query interpolation
cur.execute(
  "INSERT INTO users (name, age) VALUES (%s, %s)",
  ('David', 28)
)

# Insert data using a prepared statement
# See https://www.postgresql.org/docs/8.1/sql-syntax.html#AEN1368
prepared_statement = "INSERT INTO users (name, age) VALUES ($1, $2)"
cur.execute(prepared_statement, ('Eve', 22))

# Insert data from a list
user_list = [
  ('Frank', 40),
  ('Grace', 29),
  ('Hannah', 33)
]

# Use executemany to insert multiple rows
# This properly escapes quotes, etc.
cur.executemany(
  "INSERT INTO users (name, age) VALUES (%s, %s)",
  user_list
)

# Commit the transaction
conn.commit()

# Close the cursor and connection
cur.close()
conn.close()