naming is hard

Generate random bit string with k ones, succinct!

While solving 12th day challenge of recent Advent of Code I ran into following subtask required for the solution:

You need to uniformly generate random array of bits B = [b0, b1, …, bn-1], such that there is exactly k ones

For example, for n = 4, k = 2 there are 6 valid arrays configurations:

  1. B = 1100
  2. B = 1010
  3. B = 1001
  4. B = 0110
  5. B = 0101
  6. B = 0011

It’s not a direct subtask and couple reductions required before getting into this problem statement — but this is not so important (and anyway I chose very weird approach to use randomized algorithm with O(1) additional space just for fun).

So, how can we uniformly generate random bit string of length n with exactly k ones fast using only constant amount of memory?

Simple approach

The simplest option is to just take valid array configuration and apply random fair shuffle algorithm to it.

fn generate_non_succinct(rng: &mut SmallRng, n: usize, k: usize) -> Vec<i32> {
    let mut array: Vec<i32> = repeat(1).take(k).chain(repeat(0).take(n - k)).collect();
    array.shuffle(rng);
    return array;
}

This is perfect approach which should be used in any real-life problem as it simple, concise, robust and performant enough. But unfortunately, this solution requires O(n) additional memory for generating routine – which is not what we wanted to accomplish.

Fast solution

Actually, fast succinct solution is pretty easy and straightforward – we can just maintain amount of generated ones s on the prefix of length i and put next one with probability (k-s)/(n-i). The code for this procedure is very simple (and also cool, thanks to the scan stateful method in std::iter):

fn generate_succinct<'a>(rng: &'a mut SmallRng, n: usize, k: usize) -> impl Iterator<Item=i32> + 'a {
    return (0..n).scan(0, move |s, i| {
        let outcome = if rng.gen_range(0..n-i) < k - *s { 1 } else { 0 };
        *s += outcome;
        Some(outcome as i32)
    });
}

It’s not so straightforward to prove that every sequence has same probability equals to 1/C(n, k) where C(n,k) is n!/k!/(n-k)!.

First, we need to show that generate_succinct function can generate every possible array with k ones and no other output can be generated with this function. Indeed, we can’t generate sequences with > k ones as we will have 0% probability of generating 1 when we reach exactly k ones in a prefix (k - *s == 0). Also, we can’t generate sequences with < k ones as at some point we will inevitably have 100% probability of generating 1 (n - i == k - *s).

Last move – we need to prove that every possible outcome will have same probability. We are making exactly n choices with probability of (k-s)/(n-i) each. If we multiply all denominators independently we will immediately get n!. Considering nominator of all positive choices (generating 1) independently we will get k!. And finally – nominators for all negative choices (generating 0) will get us (n-k)!.

Weird solution

In the AoC solution I implemented another approach for generating sequence succinctly. Due to the task specific I was allowed to generated bad sequences given that they can be easily filtered out without any additional memory. Considering this, I chose to generate random binary sequence with skewed one probability of k/n. This way we will get correct sequence with probability C(n,k)*(k/n)k*((n-k)/n)n-k. If we are interested in asymptotic approximation we can use Stirling formula and get following probability: √n / √(2π k(n-k)). We should be careful with applying this formula to edge cases with very small / very large k values as approximation for binomial coefficient will work only if k = Ω(1) and n - k = Ω(1). Although from empiric results it seems like this approximate gives pretty good results:

>>> import math
>>> probs = [
    (c(n, k) * k**k * (n - k)**(n - k) / (n**n), math.sqrt(n / (2 * math.pi * k * (n - k))), n, k) 
    for n in range(1, 1024) 
    for k in range(1, n)
]
>>> max([(approx / actual, n, k) for (actual, approx, n, k) in probs])
(1.1283791670955126, 2, 1)
>>> min([(approx / actual, n, k) for (actual, approx, n, k) in probs])
(1.0002444094121852, 1023, 511)

We can see that for all possible parameters with n<1024 probability approximation leads to not more than ~13% greater values. So, we can use this to estimate asymptotic of attempts required for good sequence generation. Given that good sequence generated with probability p it is well known fact (see geometric distribution) that average amount of attempts will be equal to 1/p which is √(2π k(n-k)) / √n = O(√k √(n-k) / √n) which is O(√n) in worst case when k = n/2.