In the topic Probabilities as frequencies, we discussed that events describe the outcome of an experiment in a yes/no format. This is not always convenient: we often want to know the quantitative characteristics of the outcome of a random experiment, for example, the weight of a watermelon randomly taken from a shelf or the time it takes us to get to the airport. In probability theory, the characteristics of outcomes that are expressed by numbers are called random variables. In this topic, we will learn about discrete random variables, and how to describe them using numpy and the matplotlib library.
Random variables
If an event is a function that takes an outcome as input and outputs a boolean value, then a random variable is a function that receives an outcome as an input and responds with a number. For instance, in an experimental situation where you flip a coin until you get two heads, you can compute how many times the coin was flipped and how many tails showed up:
import random
def coin_flip():
return random.choice(['H', 'T'])
def flip_until_second_head():
total_heads = 0
outcome = []
while total_heads < 2:
outcome.append(coin_flip())
if outcome[-1] == 'H':
total_heads += 1
return outcome
def number_of_flips(outcome):
return len(outcome)
def number_of_tails(outcome):
return sum(flip == 'H' for flip in outcome)
outcome = flip_until_second_head()
print('Outcome:', outcome)
print('Number of flips:', number_of_flips(outcome))
print('Number of tails:', number_of_tails(outcome))
Outcome: ['H', 'T', 'T', 'T', 'H']
Number of flips: 5
Number of heads: 3
Using random variables defined for the same experiment, you can create new random variables. For example, to calculate the number of heads, you can just subtract number_of_tails from number_of_flips.
Just like events, it's useful to study random variables in the context of not just one trial, but many trials. The value of the random variable in one trial is called a realization of this variable. Let's generate 30 realizations of the number_of_flips variable:
NUMBER_OF_TRIALS = 30
for i in range(NUMBER_OF_TRIALS):
print(number_of_flips(flip_until_second_head()), end=' ')
3 4 2 4 3 4 2 5 4 3 4 2 6 3 3 3 3 6 3 5 2 5 8 9 4 4 4 3 3 3
How do you analyze such a bunch of numbers? For an event, you calculated the fraction of outcomes in which it occurred. That was enough, as an event can only have two "values": it occurred or it didn't occur. A random variable can take many different values, so it seems reasonable to calculate such fractions for each value. This understanding brings us to the concept of a discrete random variable.
Discrete Random Variables
A random variable is named discrete if each of its values has a positive chance. You can represent a discrete variable with a table like this:
|
Value |
|
|
... |
|
Probability |
|
|
... |
where are possible values; it could be an infinite list, and represent their probabilities. A function that accepts a value and produces its probability is known as the probability mass function of the random variable.
The probability mass function uniquely describes a discrete random variable — you cannot distinguish between two random variables with the same mass function if you only analyze their realizations.
Like probabilities, the precise mass function can't be determined from finite data. However, you can estimate it by counting the unique values and their fractions. This estimate is known as an empirical probability mass function. In Python, you typically do this with the numpy.unique(a, return_counts=True) function, where a is a list of realizations. This function generates two numpy arrays: one containing unique values and the other containing their counts. Dividing the counts by their total gives us the corresponding fractions. Let's try to estimate the mass function for number_of_flips using 10,000 trials:
import numpy as np
NUMBER_OF_TRIALS = 10000
realizations = []
for i in range(NUMBER_OF_TRIALS):
realizations.append(number_of_flips(flip_until_second_head()))
unique_values, counts = np.unique(realizations, return_counts=True)
fractions = counts / counts.sum()
for value, fraction in zip(unique_values, fractions):
print(f'{value}: {fraction}')
2: 0.2487
3: 0.2447
4: 0.189
5: 0.1294
6: 0.0789
7: 0.0455
8: 0.0271
9: 0.0151
10: 0.0109
11: 0.0059
12: 0.0023
13: 0.0014
14: 0.0008
15: 0.0001
16: 0.0001
17: 0.0001
With more trials, the calculated fractions get closer to the probabilities, and our empirical mass function becomes more like the true mass function.
If the probability of a value is negligible, it may never appear even in a large number of trials. This implies that you can't determine which values the random variable takes or doesn't take from a finite set of trials.
Visualizing random variables
The empirical probability mass function conveys information about realizations in a compact way. In the example above, instead of 10,000 realizations, we can store just 16 pairs of numbers and use them to assess probabilities of complex events associated with the random variable's behavior. For that, we only need to choose the values of the random variable where the event occurs and then sum up the corresponding fractions. For instance, let's estimate the chance that the number_of_flips will be even:
def is_even(value):
return value % 2 == 0
estimate = 0
for value, fraction in zip(unique_values, fractions):
if is_even(value):
estimate += fraction
print(estimate)
0.5578
A bar chart can help to understand a random variable's behavior better:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.bar(unique_values, fractions)
plt.show()
It's clear that the most likely values range from 2 to 6, after which the probabilities start to decrease rapidly. We barely see the bars around 16 due to their significantly reduced probabilities!
As useful as a bar chart is, it has some limitations. Let's say we are conducting an experiment. On any given day, site A is visited by random.randrange(1000, 10001) users, while site B gets random.randrange(3000, 12001) users. We are interested in the difference between the second and first numbers. Let's perform 10000 trials and draw its respective bar chart:
def visits():
return random.randrange(1000, 10001), random.randrange(3000, 12001)
def difference(outcome):
first, second = outcome
return second - first
NUMBER_OF_TRIALS = 10000
realizations = []
for i in range(NUMBER_OF_TRIALS):
realizations.append(difference(visits()))
unique_values, counts = np.unique(realizations, return_counts=True)
fractions = counts / counts.sum()
plt.figure(figsize=(10, 4))
plt.bar(unique_values, fractions)
plt.show()
Uh... The result doesn't seem very informative.
When the number of possible values is large, and their probabilities are small, almost all the realizations are unique and occur only once. That's why many bars are as high as 1 / NUMBER_OF_TRIALS, and the bar chart ends up looking cluttered.
To handle such random variables effectively, you can use a histogram. However, this is already too much information for one topic, so be patient till next topic to learn about histograms.
Conclusion
Too much information? Here is a short recap:
-
Unlike yes/no events, random variables provide numerical details.
-
Python, with NumPy and Matplotlib, helps describe and visualize random variables.
-
Discrete random variables have probable values, shown by a mass function.
-
Empirical mass functions estimate probabilities from trial counts.
-
Bar charts visualize random variable outcomes, revealing insights.
-
Limitations arise with large values and small probabilities, needing techniques like histograms.