A smart and talented friend, who’s a fellow data scientist by profession, came across a probability puzzle that’s supposedly given in some job interviews for data science roles. The problem asks the following:

Assume the distribution of children per family is such that the probability of having 0 children is 0.30, 1 child is 0.25, 2 children is 0.20, 3 children is 0.15, 4 children is 0.10, and 5 or more children is 0. Consider a random girl in the population of children. What is the probability that she has a sister?

Checking her work, my friend disagreed with the answers she found online, and wanted to know how I’d solve the problem. Since my solution seems to be different than others I’ve found so far, I thought I’d share it with the whole Internet, which will certainly tell me whether I’m wrong.

Simulating the problem

I’ll start with a simulation—I’m a bit more confident in my programming than my probability theory (I’ve been doing it longer), and it makes some of the tricky parts of the problem explicit.

First, we’ll simulate a random sample of families, since the problem gives us probabilities in terms of children per family. After generating a pool of random family sizes, we can use the binomial distribution to generate a random number of girls for each family (naively assuming that the probability of a child being a girl is 0.5).

set.seed(232636039)
num_families <- 100000

family_size <- seq(0, 5)
p_family_size <- c(0.3, 0.25, 0.20, 0.15, 0.1, 0.0)

girls_per_family <- sample(family_size, num_families, replace = TRUE,
                           prob = p_family_size) |>
  (\(n) rbinom(num_families, size = n, prob = 0.5))()

If a family has more than two girls, these girls have sisters. The probability that a randomly selected girl has a sister is approximated by the number of girls with sisters divided by the total number of girls in our simulation.

# If there is more than one girl in a family, these girls have sisters
girls_with_sisters <- sum(girls_per_family[girls_per_family > 1])
girls_total <- sum(girls_per_family)

(p_est <- girls_with_sisters / girls_total)
## [1] 0.5878486

The code above estimates that the probability that a girl (sampled at random from the population of children) has a sister is approximately 0.59.

Plotting the simulation

Just for fun, let’s plot the estimate as it develops over the number of simulated families. This also helps us get a sense of how trustworthy the estimate is—has it converged to a stable value, or is it still swinging around? For this, I’ll finally use parts of the Tidyverse.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
sister_dat <- tibble(girls_per_family) |>
  mutate(families = seq(n()),
         has_sister = girls_per_family > 1,
         total_girls = cumsum(girls_per_family),
         girls_with_sisters = cumsum(girls_per_family * has_sister),
         p_est = girls_with_sisters / total_girls)
sister_dat |>
  # We'll skip any beginning rows that were generated before we sampled girls,
  # since 0/0 is undefined.
  filter(total_girls > 0) |>
  ggplot(aes(families, p_est)) +
  geom_step() +
  geom_hline(yintercept = 71/120, linetype = "dashed") +
  scale_x_log10() +
  theme_minimal() +
  labs(x = "Number of random families", y = "Observed proportion of girls with sisters")

Analytic solution

If you’re wondering how I chose a value for the dashed line, here’s an analytic solution to the puzzle.

We’ll let S represent the event that a girl has a sister, Fₙ represent the event that a family has n children, and Cₙ represent the event that a girl comes from a family of n children (i.e., the girl has n - 1 siblings). By the law of total probability:

$$ \mathbb{P}(S) = \sum_{n} \mathbb{P}(S|C_n) \mathbb{P}(C_n) $$

The conditional probability that a girl from a family with n children has a sister is the compliment of the probability that all of her siblings are brothers. Therefore:

$$ \mathbb{P}(S|C_n) = 1 - \frac{1}{2^{n-1}} $$

For the next component, it’s important to call out the fact that the problem text gives us P(Fₙ) (the probability that a family has n children), which is not the same as P(Cₙ) (the probability that a child comes from a family with n children). I suspect that this trips some people up, and explains some of the answers that are different than mine.

To illustrate the difference, let’s imagine that the probabilities were constructed from a population of 100 families. Twenty five of these families have one child, while only ten have four children. However, this arrangement results in 25 children with no siblings, while 40 children would have 3 each. In other words, a randomly selected family is more likely to have one child than four children, but a randomly selected child is more likely to come from a family with four children than a family with one child. Formalizing this reasoning a bit, we find:

$$ \mathbb{P}(C_n) = \frac{n \mathbb{P}(F_n)}{\sum_{m} m \mathbb{P}(F_m)} $$

With our terms defined, we’ll use R once more to calculate P(S):

n <- family_size
fn <- p_family_size
s_given_cn <- 1 - 1/(2^(n-1))
cn <- n * fn / sum(n * fn)
(s <- sum(s_given_cn * cn))
## [1] 0.5916667

This value is equal to 71/120, and is quite close to what we found through simulation—but only reliably so after generating around 10,000 families.

Concluding note

I’m ambivalent about the practice of asking job applicants to solve probability puzzles during interviews, but it’s true that sometimes our intuitions can lead us astray (e.g, confusing P(Fₙ) and P(Cₙ)), and this does come up in real-life data analyses. When I find myself dealing with such situations, I’m flexible in the approach I take: sometimes I try to find an exact solution using probability theory first, but often I program a quick simulation instead. Most of the time, I wind up doing both, because these approaches complement one another: simulations are great because they force you to be explicit about the assumptions that you’re making about the problem, while analytic solutions give exact answers and are insensitive to randomness.