Reservoir Sampling & Why I <3 Online Algorithms

The namesake of the method of the day, a reservoir; namely the Craig Goch reservoir in Wales, UK. Image source: Shropshire Star

:: Intro to the Reservoir⌗

It was a good few years ago when I first discovered Reservoir Sampling. It’s one of those techniques that struck a kind of personal resonance that gave a “woah that’s cool”. Looking back, I think it was the combination of it being one of the first online algorithms (more on this in a bit) I really “got” and also one that’s elegant in its simplicity as well as applicability.

In statistics, sampling is a pretty fundamental operation: you either can’t or do not want to deal with collecting every datapoint that represents a population in its entirety, so instead a subset of the population is studied – the sample – and findings on the sample are then used to generalize to the population. A lot of stats techniques dive deep into this, but a top-level idea is that larger sample sizes give a better representation of the population (with diminishing returns). In a data analysis rather than survey sampling context, this might be equivalent to choosing a subset of rows from a data table to study.

The simplest form of sampling is a random sample, where a set of records are either randomized in order, and then the top n records are selected for the sample, or each record is iterated over with a possible selection probability of 1/n, resulting in an expected (but not guaranteed) n-sized sample.

A probability-based sampling algorithm might look something like:

import random

def naive_sample(input_data: list, p: float): -> list
    '''
    input_data: the values to iterate over and sample from
    p: the probabilty of selecting any given item
    '''
    chosen_sample = []
    
    for i in input_data:
        # sample a random number [0,1]
        if random.uniform() < p:
            chosen_sample.append(i)
    
    return chosen_sample

Note: There are many faster, better ways to sample in Python, including the random library’s own random.choice, or using Numpy’s random functions to generate indices to sample then slicing them out. My purpose in writing this and other examples is to compare basic algorithms as they can be implemented in a reader’s language of choice.

This is great if you have ready access to all your data, or are working with a finite number of records, but what if you don’t know how many entries you have to work with? What if n is a very large, but unknown number? What if n is huge and you want to get a reasonable estimate of things as you process them? What if n is in fact infinite, with more data always expected to stream in and be analyzed on the fly!?

Well that’s where online algorithms specialize and Reservoir Sampling comes to the rescue! Per Wikipedia:

In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.

Sounds like just the thing we want! Let’s begin with an intro to the Reservoir Sampling algorithm, then I’ll code up an example that runs it in action to estimate the mean and standard deviation of a population.

:: Reservoir Sampling⌗

I’ll begin with a technically thorough explanation of the sampling algorithm from some course notes by Dr. Phillips at the University of Utah, then follow with a couple of my own additions and a pass at an implementation of what’s described.

The first step of any reservoir algorithm is to put the first n records of the file into a reservoir. The rest of the records are processed sequentially; records can be selected for the reservoir only as they are processed. An algorithm is a reservoir algorithm if it maintains the invariant that after each record is processed a true random sample of size n can be extracted from the current state of the reservoir. This algorithm runs in O(N) because the entire file must be scanned.

This begins with a seeding of the reservoir. The first n records encountered automatically fill the reservoir because there are empty slots available. The size of the reservoir then becomes the size of the sample. This first step might seem strange or wrong at first. After all, the dataset hasn’t been randomized. Why take the first n records as the sample? Well, by the time the nth record has been observed, that’s all the info the algorithm has at that moment, so the best choice is to estimate based on all the data seen so far.

So, we assume we have a continuous random variable that can be generated quickly and whose distribution approximate well. It can be generated very quickly, in constant time. So, At the ith element of S, the algorithm generates a random number j between 1 and i. If j is less than n, the jth element of the reservoir array is replaced with the ith element of S. In effect, for all i, the ith element of S is chosen to be included in the reservoir with probability n/i. Assume i+1 to m elements kept in reservior. Now, probability of not replaced element from reservior is 1-(m-i) /m = i/m. Total probability of kept and not replaced elements is (n/i)*(i/m) = n/m. So, when algorithm finished executing, each element has equal probability of being chosen from reservior.

This section details the algorithm in action, with the brilliance of the algorithm highlighted with the sentence at the very end. Let’s take a look at an implementation I drafted based on this description, since reading an algorithm’s description can be vague.

import random

def reservoir_sample_offline(input_data: list, reservoir_size: int): -> list
    '''
    input_data: the values to iterate over and sample from
    reservoir_size: the number of items to return in the sample (n)
    '''
    # initialize the reservoir with proper size
    reservoir = [0 for i in range(0, reservoir_size)]

    for i in range(0, len(input_data)):
        
        # if reservoir isn't full yet, fill next slot
        if i < reservoir_size:
            reservoir[i] = input_data[i]
        
        else:
            # randomly select an integer between 0 and the number of vals seen
            j = random.randint(0, i)
            # if the randomly chosen value "hit" inside the reservoir
            if j < reservoir_size:
                # replace that entry with the new value
                reservoir[j] = input_data[i]
    
    return reservoir

So why’s that last statement brilliant? It’s because the replacement probability of any given slot in the reservoir decreases linearly as more observations are found. Well, wouldn’t this bias it so that earlier observations have a higher chance of entering the reservoir and sticking around?

Not quite! Because although the chance of a later observation entering the reservoir is lower, the earlier observations have more opportunities to be replaced by later ones. The magic of this scaling is that it’s linear, so that by the time you reach the end of the data, you end up with a 1/n probability of each item being in the sample, and we did so in O(n) time. Brilliant!

(Okay so maybe O(n) isn’t that great… but it’s not shabby either. Maybe I’ll write another post on online algorithm efficiency.)

:: What about an online version?⌗

D.Va from Overwatch — "D.VA online!" Soon this Reservoir Sampler will be online too. Image source

You may have noticed that the function I wrote above is titled `reservoir_sample_offline` and actually relies on the full data being passed in. Well, if we have the full data passed in, we've got other ways of working with it more easily and efficiently than this. So let's write one to work how it shines! When it works on an incoming stream of data! I'll be writing this in a custom class too to manage some internal state. It's not strictly required for the algorithm, but will help it run as a self-contained unit.

class ReservoirSampler:

    def __init__(self, reservoir_size: int):
        self.reservoir_size = reservoir_size
        self.reservoir = [0 for i in range(0, reservoir_size)]
        self.i = 0
    
    def get_sample(): -> list
        return self.reservoir
    
    def new_observation(obs): -> None

        # if reservoir isn't full yet, fill next slot
        if self.i < self.reservoir_size:
            reservoir[self.i] = obs
        else:
            # randomly select an integer between 0 and the number of vals seen
            j = random.randint(0, self.i)
            # if the randomly chosen value "hit" inside the reservoir
            if j < self.reservoir_size:
                # replace that entry with the new value
                reservoir[j] = input_data[i]
        
        # increment number of observed data points
        self.i += 1

        return

And that’s all there is to it! We’d use this class like:

reservoir_size = 100
reservoir = ReservoirSampler(reservoir_size)

n_obs = 1000
for i in range(1, n_obs):
    reservoir

Now, if instead of iterating over a list, a process fed new values to our Python program, or a generator yielded an unknown number of observations, our ReservoirSampler could handle them with ease, giving us an up-to-date and accurate sample at any point along the way.

:: Next Steps: Trying it out & testing the sampling⌗

I’ve covered a bit about Reservoir Sampling and the joy of online algorithms. In a follow-up post, I plan to put this algorithm to the test on simple statistics like estimating mean and variance, and show how it closely it does (or doesn’t) converge on the true population mean/variance over multiple simulated trials.

Part 2: actually consuming streaming results idea: AI image generation setup