[Extra] Sketches¶

In [1]:
## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://localhost:5432/baseball
In [2]:
%sql CREATE EXTENSION IF NOT EXISTS plpython3u; -- import extension
Running query in 'postgresql://localhost:5432/baseball'
Out[2]:

Two Very Useful Sketches¶

  • CountMin sketches for SELECT COUNT(*) ... WHERE col = k
    • Also WHERE col BETWEEN k AND l
  • HyperLogLog sketches for SELECT COUNT(DISTINCT col)

Inuition for CountMin Sketch¶

SELECT COUNT(*) ... WHERE col = x

Incredibly simple idea!

Intuition:

  • Pick a hash function $h$ that maps the data type of col to integers
  • Create an array $A$ of $b$ counters
  • For each value $v$ in col, increment the counter at $A[{h(v)}]$
  • To compute the count of rows WHERE col = x return $A[{h(x)}]$

How bad is our estimate?

  • Could be too high, due to hash "collisions" (never too low!)
  • If we collide with key $y \ne x$, we return the sum of their frequencies
    • $A[{h(x)}] = f_x + \sum_{y \in S}f_y \hspace{2em}$ where $S = {y \ne x : h_i(y) = h_i(x)}$
    • We expect $x$ to collide with $1/b$ of the values in our data
    • So we expect $A[{h(x)}] = f_x + \frac{1}{b}\sum_{y \ne x}f_y \le f_x + \frac{n}{b}$
      • after all, the sum of all frequencies is $n$
    • If we want to bound our overestimate by $\epsilon f_x$, just choose $b = 1/\epsilon$.
    • Note this has no dependence on $n$, the size of our dataset: scales great!

Why not stop here?

  • We got the expected error to an $\epsilon$ factor!
  • But the variance in the error can be high
    • some buckets will be "unlucky"!
    • We want the probability that the error is bigger than $\epsilon$ to be $\delta$.
  • So let's use the idea of independent samples to help.

CountMin Sketch¶

  • Repeat the idea above with $l$ pairwise-independent hash functions.
    • Easy to parameterize these from a "family"
    • The CountMin Sketch is an array of $l$ rows and $b$ columns
      • each row "belongs" to one of the hash functions
- Insertion - For each value $v$ in the input, for each hash function $l$, increment $h_l(v)$ - To compute the count of key $x$, look up all $l$ values $h_l(x)$ - Each is an overestimate... - So return the minimum: $\min_{i=0}^l A_i[x]$! - With some more work, you can show that we want $l = \log_2 \frac{1}{\delta}$ - bounds the probability of exceeding $\epsilon$ at $\delta$ - google for the math or take Prof. Nelson's class!

CountMin Sketches in Practice¶

  • Easy to write

    • You can find lots of reference implementations online in Java, C++, Python, etc.
    • PostgreSQL has a package called Apache MADLib that provides CountMin sketches and lots of Stat/ML routines in SQL
  • Scales to arbitrarily large data sets!

  • In practice, all the hashing at construction time can be slow

    • Needs to be paid off by many queries
    • Parallelizes trivially though!
  • PostgreSQL/MADlib example below

    • MADlib CMsketch is set to $l = 8, b = 1024$. With 64-bit integers this is just 64 KB!
      • $\epsilon = 2/b = .002$
      • $\delta = \frac{1}{2}^l = .004$ (i.e., $99.6\%$ probability within $\epsilon n$!)
In [3]:
## MADlib is only compatible with PostgreSQL 12
## My PostgreSQL 12 installation didn't have Python3
## So let's switch connections now
%reload_ext sql
%sql postgresql://localhost:5433/baseball
Connecting and switching to connection postgresql://localhost:5433/baseball
RuntimeError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5433 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (::1), port 5433 failed: Cannot assign requested address
	Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8)
If you need help solving this issue, send us a message: https://ploomber.io/community
In [ ]:
%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class, 
                     floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
     sketch AS (SELECT madlib.cmsketch(class) AS class_cm, madlib.cmsketch(a1) AS a1_cm FROM data)
SELECT 'sketch' as method, 
       madlib.cmsketch_count(class_cm, 7) as class_7, madlib.cmsketch_count(class_cm, 9) as class_9, 
       madlib.cmsketch_count(a1_cm, 3) as a1_3, madlib.cmsketch_count(a1_cm, 7) as a1_7
  FROM sketch
UNION ALL
SELECT 'actual', 
       sum(CASE WHEN class = 7 THEN 1 ELSE 0 END), sum(CASE WHEN class = 9 THEN 1 ELSE 0 END),
       sum(CASE WHEN a1 = 3 THEN 1 ELSE 0 END), sum(CASE WHEN a1 = 7 THEN 1 ELSE 0 END)
  FROM data;
In [ ]:
%%sql
WITH sketch AS (SELECT madlib.cmsketch(hr) AS hr_cm FROM batting)
SELECT 'sketch' as method, 
       madlib.cmsketch_count(hr_cm, 40)
  FROM sketch
UNION ALL
SELECT 'actual', 
       COUNT(*)
  FROM batting
 WHERE hr = 40;

HyperLogLog Sketch¶

SELECT COUNT(DISTINCT col) FROM table

Problem:

  • Imagine you've seen 1 billion distinct values so far
    • Originally studied to classify network packet streams at line rate
    • How many unique source/destination pairs have we seen?
  • To see if a new row has a new value, we need to remember the previous 1 billion!
  • Can we do this in a small amount of space?

HyperLogLog is one of many solutions to this problem.

  • Jelani Nelson is a co-author on the first asymptotically space- and time-optimal algorithm for this problem.
  • Recently completely resolved the asymptotic space complexity of this problem!

HyperLogLog Intuition¶

  • For each value $v$ we see, compute a hash $h(v)$
    • Generates a number chosen uniformly at random between 0 and $\infty$
    • How many leading 0's on the left do we expect?
In [ ]:
# https://www.geeksforgeeks.org/number-of-leading-zeros-in-binary-representation-of-a-given-number/
def countZeros(x): 
    # Keep shifting x by one until 
    # leftmost bit does not become 1.
    total_bits = 32
    res = 0
    while ((x & (1 << (total_bits - 1))) == 0):
        x = (x << 1)
        res += 1
  
    return res


from random import randint
import pandas as pd
s = pd.Series([countZeros(hash(i)) for i in range(1,1000)])
s.plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')

Run the cell above a few times and you'll see:

  • About 1/2 the values have no leading zeros. Makes sense!
    • These are random bit strings, so odds that first bit is 1 is $50\%$
  • Decays by a factor of 2 for each bar to the right. Makes sense!
    • Odds that the first 2 bits are 10 is $25\%$. Etc.
  • Continuing to divide by 2, we expect the rightmost non-zero bar to be at $\rho = log_2(n) - 1$
    • Where $n$ is the total number of values
    • But note: adding duplicate values raises all bars but doesn't add any new bars to the right!
    • So $2^{\rho + 1}$ is a good estimator of COUNT(DISTINCT)!!
  • But ... lots of variance across trials.
  • HyperLogLog breaks the input into subsets and uses the harmonic mean of the resulting estimates

HyperLogLog in Practice¶

  • With $m$ bytes of memory gives accuracy of $1.04/\sqrt{m}$
    • E.g. $2%$ accuracy with 1.5KB of memory
  • You can find lots of implementations online
  • Apache MADlib has an older sketch from the same lead author (P. Flajolet) called an FM sketch
In [ ]:
%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class, 
                     floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
     approx AS (SELECT madlib.fmsketch_dcount(class) AS class_fm, madlib.fmsketch_dcount(a1) AS a1_fm FROM data)
SELECT 'sketch' as method, 
       *
  FROM approx
UNION ALL
SELECT 'actual', 
       COUNT(DISTINCT class), COUNT(DISTINCT a1)
  FROM data;

More on Sketching¶

Sketches are like materialized views that can approximate the answer to a class of queries.

  • Like Materialized views, they take time to build, and need to be kept "fresh"
    • But they're typically tiny and insensitive to input size, which is very cool
    • Can pass them around for all kinds of tricks: e.g. ship to apps in browser or phone, etc.
  • Sketches typically work as streaming algorithms, which is nice
    • Most support incremental additions
    • Some support deletions
  • Many can be computed in parallel

More on Sketching, cont.¶

  • There are more sketch types and variants to handle more classes of queries
    • "Heavy Hitter" queries (return the top k most popular values in the stream)
    • Exists queries (the earliest sketch: Bloom Filters)
    • Count-Range queries
    • Histograms
    • Approximate data cubes
    • Etc.

Sketches are mostly used in high-volume streaming settings

  • The approximation/performance tradeoffs has to be acceptable
  • You need to have a need to do LOTS of queries on the sketch to amortize cost of hashing
  • Not typically supported in database systems even today

Additional Resources¶

  • Prof. Nelson's graduate course at Berkeley
  • Book: Cormode/Garofalakis/Haas/Jermaine Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches