Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

7.1 Database Sampling

Last Updated: December 5th, 2025

Why Use Approximation?

Pros

Cons

What Can We Do Instead?

When to Use Database Pipeline Sampling?

Database Sampling Drawbacks

Reservoir Sampling

Reservoir Sampling is a technique for obtaining a fixed k-sized SRS (simple random sample) from a dataset. Each record has an equal probability (k/n) of being included in the sample, and the algorithm operates in linear runtime with a single pass, even when the total number of records (n) is unknown.

How It Works

1. Build a Reservoir

2. Scan the Table

For each record rᵢ:

Example Scenario

Stratified Sampling

We can also conduct stratified sampling, where we can sample based on an attribute of our data. Our goal is to get a k-sized sample per GROUP attribute. The GROUP BY columns are called subpopulations.

Why stratified sampling? In a simple random sample over skewed data, rare groups are often excluded entirely by chance. Stratified sampling guarantees that every group is represented, regardless of how small it is in the full population.

PostgreSQL’s Bernoulli tablesamples do not support stratification because the sample happens on the initial scan and cannot access the attributes. However, our reservoir implementation works with GROUP BY!

Sampling Pitfalls

Never Join Samples:

Bias Accumulation:

To avoid bias and inaccuracies: