While exploring ways to estimate multiset union and intersection cardinalities, I discovered an interesting result mentioned in a paper by Shukla [1], who in turn cites Feller [2]: If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is

.

Given an element *a* from the set, the probability that there are none of the elements after r draws is *a* is . Thus the probability that there is at least one element is *a* (or equivalently that there is at least one unique item) is , and hence the expected number of uniques is . Alternatively, as Feller explains, the probability of a single unique at the ith step is , hence the expected number of uniques after r draws is

A geometric series that converges to .

This is an instance of the Birthday problem. Stackexchange has a couple of other more complicated but interesting solutions.

[1] A. Shukla et al, Storage estimation for multidimensional aggregates in the presence of hierarchies [paper]

[2] W. Feller, An introduction to probability theory and its applications, vol. 1, 1957

### Like this:

Like Loading...

*Related*