While solving 12th day challenge of recent Advent of Code I ran into following subtask required for the solution:
You need to uniformly generate random array of bits B = [b0, b1, …, bn-1], such that there is exactly k ones
For example, for n = 4, k = 2 there are 6 valid arrays configurations:
- B = 1100
- B = 1010
- B = 1001
- B = 0110
- B = 0101
- B = 0011
It’s not a direct subtask and couple reductions required before getting into this problem statement — but this is not so important (and anyway I chose very weird approach to use randomized algorithm with O(1) additional space just for fun).
So, how can we uniformly generate random bit string of length n with exactly k ones fast using only constant amount of memory?
The simplest option is to just take valid array configuration and apply random fair shuffle algorithm to it.
fn generate_non_succinct(rng: &mut SmallRng, n: usize, k: usize) -> Vec<i32> {let mut array: Vec<i32> = repeat(1).take(k).chain(repeat(0).take(n - k)).collect();array.shuffle(rng);return array;}
This is perfect approach which should be used in any real-life problem as it simple, concise, robust and performant enough. But unfortunately, this solution requires O(n) additional memory for generating routine – which is not what we wanted to accomplish.
Actually, fast succinct solution is pretty easy and straightforward – we can just maintain amount of generated ones s on the prefix of length i and put next one with probability (k-s)/(n-i). The code for this procedure is very simple (and also cool, thanks to the scan stateful method in std::iter):
fn generate_succinct<'a>(rng: &'a mut SmallRng, n: usize, k: usize) -> impl Iterator<Item=i32> + 'a {return (0..n).scan(0, move |s, i| {let outcome = if rng.gen_range(0..n-i) < k - *s { 1 } else { 0 };*s += outcome;Some(outcome as i32)});}
It’s not so straightforward to prove that every sequence has same probability equals to 1/C(n, k) where C(n,k) is n!/k!/(n-k)!.
First, we need to show that generate_succinct function can generate every possible array with k ones and no other output can be generated with this function. Indeed, we can’t generate sequences with > k ones as we will have 0% probability of generating 1 when we reach exactly k ones in a prefix (k - *s == 0). Also, we can’t generate sequences with < k ones as at some point we will inevitably have 100% probability of generating 1 (n - i == k - *s).
Last move – we need to prove that every possible outcome will have same probability. We are making exactly n choices with probability of (k-s)/(n-i) each. If we multiply all denominators independently we will immediately get n!. Considering nominator of all positive choices (generating 1) independently we will get k!. And finally – nominators for all negative choices (generating 0) will get us (n-k)!.
In the AoC solution I implemented another approach for generating sequence succinctly. Due to the task specific I was allowed to generated bad sequences given that they can be easily filtered out without any additional memory. Considering this, I chose to generate random binary sequence with skewed one probability of k/n. This way we will get correct sequence with probability C(n,k)*(k/n)k*((n-k)/n)n-k. If we are interested in asymptotic approximation we can use Stirling formula and get following probability: √n / √(2π k(n-k)). We should be careful with applying this formula to edge cases with very small / very large k values as approximation for binomial coefficient will work only if k = Ω(1) and n - k = Ω(1). Although from empiric results it seems like this approximate gives pretty good results:
>>> import math>>> probs = [(c(n, k) * k**k * (n - k)**(n - k) / (n**n), math.sqrt(n / (2 * math.pi * k * (n - k))), n, k)for n in range(1, 1024)for k in range(1, n)]>>> max([(approx / actual, n, k) for (actual, approx, n, k) in probs])(1.1283791670955126, 2, 1)>>> min([(approx / actual, n, k) for (actual, approx, n, k) in probs])(1.0002444094121852, 1023, 511)
We can see that for all possible parameters with n<1024 probability approximation leads to not more than ~13% greater values. So, we can use this to estimate asymptotic of attempts required for good sequence generation. Given that good sequence generated with probability p it is well known fact (see geometric distribution) that average amount of attempts will be equal to 1/p which is √(2π k(n-k)) / √n = O(√k √(n-k) / √n) which is O(√n) in worst case when k = n/2.