, ,

While exploring ways to estimate multiset union and intersection cardinalities, I discovered an interesting result mentioned in a paper by Shukla [1], who in turn cites Feller [2]: If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is

n - n (1 - 1/n)^r.

Given an element a from the set, the probability that there are none of the elements after r draws is a is (1 - 1/n)^r. Thus the probability that there is at least one element is a (or equivalently that there is at least one unique item) is 1 - (1 - 1/n)^r, and hence the expected number of uniques is n - n (1 - 1/n)^r. Alternatively, as Feller explains, the probability of a single unique at the ith step is ((1 - 1/n)/n)^{i-1}, hence the expected number of uniques after r draws is

 \sum_{i=1}^{r} ((1 - 1/n)/n)^{i-1}

A geometric series that converges to n - n (1 - 1/n)^r.

This is an instance of the Birthday problemStackexchange has a couple of other more complicated but interesting solutions.

[1] A. Shukla et al, Storage estimation for multidimensional aggregates in the presence of hierarchies [paper]
[2] W. Feller, An introduction to probability theory and its applications, vol. 1, 1957