[Extra] Sketches¶

In [1]:

## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://localhost:5432/baseball

In [2]:

%sql CREATE EXTENSION IF NOT EXISTS plpython3u; -- import extension

Running query in 'postgresql://localhost:5432/baseball'

Out[2]:

Two Very Useful Sketches¶

CountMin sketches for SELECT COUNT(*) ... WHERE col = k
- Also WHERE col BETWEEN k AND l
HyperLogLog sketches for SELECT COUNT(DISTINCT col)

Inuition for CountMin Sketch¶

SELECT COUNT(*) ... WHERE col = x

Incredibly simple idea!

Intuition:

Pick a hash function $h$ that maps the data type of col to integers
Create an array $A$ of $b$ counters
For each value $v$ in col, increment the counter at $A[{h(v)}]$
To compute the count of rows WHERE col = x return $A[{h(x)}]$

How bad is our estimate?

Could be too high, due to hash "collisions" (never too low!)
If we collide with key $y \ne x$, we return the sum of their frequencies
- $A[{h(x)}] = f_x + \sum_{y \in S}f_y \hspace{2em}$ where $S = {y \ne x : h_i(y) = h_i(x)}$
- We expect $x$ to collide with $1/b$ of the values in our data
- So we expect $A[{h(x)}] = f_x + \frac{1}{b}\sum_{y \ne x}f_y \le f_x + \frac{n}{b}$
  - after all, the sum of all frequencies is $n$
- If we want to bound our overestimate by $\epsilon f_x$, just choose $b = 1/\epsilon$.
- Note this has no dependence on $n$, the size of our dataset: scales great!

Why not stop here?

We got the expected error to an $\epsilon$ factor!
But the variance in the error can be high
- some buckets will be "unlucky"!
- We want the probability that the error is bigger than $\epsilon$ to be $\delta$.
So let's use the idea of independent samples to help.

CountMin Sketch¶

Repeat the idea above with $l$ pairwise-independent hash functions.
- Easy to parameterize these from a "family"
- The CountMin Sketch is an array of $l$ rows and $b$ columns
  - each row "belongs" to one of the hash functions

- Insertion - For each value $v$ in the input, for each hash function $l$, increment $h_l(v)$ - To compute the count of key $x$, look up all $l$ values $h_l(x)$ - Each is an overestimate... - So return the minimum: $\min_{i=0}^l A_i[x]$! - With some more work, you can show that we want $l = \log_2 \frac{1}{\delta}$ - bounds the probability of exceeding $\epsilon$ at $\delta$ - google for the math or take Prof. Nelson's class!

CountMin Sketches in Practice¶

Easy to write
- You can find lots of reference implementations online in Java, C++, Python, etc.
- PostgreSQL has a package called Apache MADLib that provides CountMin sketches and lots of Stat/ML routines in SQL
Scales to arbitrarily large data sets!
In practice, all the hashing at construction time can be slow
- Needs to be paid off by many queries
- Parallelizes trivially though!
PostgreSQL/MADlib example below
- MADlib CMsketch is set to $l = 8, b = 1024$. With 64-bit integers this is just 64 KB!
  - $\epsilon = 2/b = .002$
  - $\delta = \frac{1}{2}^l = .004$ (i.e., $99.6\%$ probability within $\epsilon n$!)

In [3]:

## MADlib is only compatible with PostgreSQL 12
## My PostgreSQL 12 installation didn't have Python3
## So let's switch connections now
%reload_ext sql
%sql postgresql://localhost:5433/baseball

Connecting and switching to connection postgresql://localhost:5433/baseball

RuntimeError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5433 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (::1), port 5433 failed: Cannot assign requested address
	Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8)
If you need help solving this issue, send us a message: https://ploomber.io/community

In [ ]:

%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class, 
                     floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
     sketch AS (SELECT madlib.cmsketch(class) AS class_cm, madlib.cmsketch(a1) AS a1_cm FROM data)
SELECT 'sketch' as method, 
       madlib.cmsketch_count(class_cm, 7) as class_7, madlib.cmsketch_count(class_cm, 9) as class_9, 
       madlib.cmsketch_count(a1_cm, 3) as a1_3, madlib.cmsketch_count(a1_cm, 7) as a1_7
  FROM sketch
UNION ALL
SELECT 'actual', 
       sum(CASE WHEN class = 7 THEN 1 ELSE 0 END), sum(CASE WHEN class = 9 THEN 1 ELSE 0 END),
       sum(CASE WHEN a1 = 3 THEN 1 ELSE 0 END), sum(CASE WHEN a1 = 7 THEN 1 ELSE 0 END)
  FROM data;

In [ ]:

%%sql
WITH sketch AS (SELECT madlib.cmsketch(hr) AS hr_cm FROM batting)
SELECT 'sketch' as method, 
       madlib.cmsketch_count(hr_cm, 40)
  FROM sketch
UNION ALL
SELECT 'actual', 
       COUNT(*)
  FROM batting
 WHERE hr = 40;

HyperLogLog Sketch¶

SELECT COUNT(DISTINCT col) FROM table

Problem:

Imagine you've seen 1 billion distinct values so far
- Originally studied to classify network packet streams at line rate
- How many unique source/destination pairs have we seen?
To see if a new row has a new value, we need to remember the previous 1 billion!
Can we do this in a small amount of space?

HyperLogLog is one of many solutions to this problem.

Jelani Nelson is a co-author on the first asymptotically space- and time-optimal algorithm for this problem.
Recently completely resolved the asymptotic space complexity of this problem!

HyperLogLog Intuition¶

For each value $v$ we see, compute a hash $h(v)$
- Generates a number chosen uniformly at random between 0 and $\infty$
- How many leading 0's on the left do we expect?

In [ ]:

# https://www.geeksforgeeks.org/number-of-leading-zeros-in-binary-representation-of-a-given-number/
def countZeros(x): 
    # Keep shifting x by one until 
    # leftmost bit does not become 1.
    total_bits = 32
    res = 0
    while ((x & (1 << (total_bits - 1))) == 0):
        x = (x << 1)
        res += 1
  
    return res


from random import randint
import pandas as pd
s = pd.Series([countZeros(hash(i)) for i in range(1,1000)])
s.plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')

Run the cell above a few times and you'll see:

About 1/2 the values have no leading zeros. Makes sense!
- These are random bit strings, so odds that first bit is 1 is $50\%$
Decays by a factor of 2 for each bar to the right. Makes sense!
- Odds that the first 2 bits are 10 is $25\%$. Etc.
Continuing to divide by 2, we expect the rightmost non-zero bar to be at $\rho = log_2(n) - 1$
- Where $n$ is the total number of values
- But note: adding duplicate values raises all bars but doesn't add any new bars to the right!
- So $2^{\rho + 1}$ is a good estimator of COUNT(DISTINCT)!!
But ... lots of variance across trials.
HyperLogLog breaks the input into subsets and uses the harmonic mean of the resulting estimates

HyperLogLog in Practice¶

With $m$ bytes of memory gives accuracy of $1.04/\sqrt{m}$
- E.g. $2%$ accuracy with 1.5KB of memory
You can find lots of implementations online
Apache MADlib has an older sketch from the same lead author (P. Flajolet) called an FM sketch

In [ ]:

%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class, 
                     floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
     approx AS (SELECT madlib.fmsketch_dcount(class) AS class_fm, madlib.fmsketch_dcount(a1) AS a1_fm FROM data)
SELECT 'sketch' as method, 
       *
  FROM approx
UNION ALL
SELECT 'actual', 
       COUNT(DISTINCT class), COUNT(DISTINCT a1)
  FROM data;

More on Sketching¶

Sketches are like materialized views that can approximate the answer to a class of queries.

Like Materialized views, they take time to build, and need to be kept "fresh"
- But they're typically tiny and insensitive to input size, which is very cool
- Can pass them around for all kinds of tricks: e.g. ship to apps in browser or phone, etc.
Sketches typically work as streaming algorithms, which is nice
- Most support incremental additions
- Some support deletions
Many can be computed in parallel

More on Sketching, cont.¶

There are more sketch types and variants to handle more classes of queries
- "Heavy Hitter" queries (return the top k most popular values in the stream)
- Exists queries (the earliest sketch: Bloom Filters)
- Count-Range queries
- Histograms
- Approximate data cubes
- Etc.

Sketches are mostly used in high-volume streaming settings

The approximation/performance tradeoffs has to be acceptable
You need to have a need to do LOTS of queries on the sketch to amortize cost of hashing
Not typically supported in database systems even today

Additional Resources¶

Prof. Nelson's graduate course at Berkeley
Book: Cormode/Garofalakis/Haas/Jermaine Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches