MathProbabilityProbabilistic modeling

Continuous random variables

3 minutes read

Not everything in our world is discrete; whether that's lucky or unlucky is for you to decide. What's certain is this situation calls for another kind of random variable: continuous random variables. But before we dive into that, let's get familiar with histograms, as we promised in the previous topic. So let's get started!

Histogram

To describe sets of numbers in which all or nearly all values are different, it's convenient to use a histogram. For this, you need to:

Divide the interval that contains all the realizations into several intervals, which are called bins.
Calculate the fraction of realizations in each bin.
For each bin, draw a bar, with this bin as the base, and the height equal to the fraction of realizations divided by the width of the bin.

Division in the third item is necessary, so the area of the bar will be approximately equal to the probability that the random variable will take a value from the bin. As the total probability equals 1, so the total area of the histogram also equates to 1.

In Python, histograms can be drawn very quickly using the plt.hist(realizations, bins=k, density=True) method, where bins is the number of bins, and density=True ensures that the area of each bar equals the fraction of realizations in the corresponding bin. Let's draw a histogram for 10,000 realizations of difference using 100 bins:

plt.figure(figsize=(10, 4))
plt.hist(realizations, bins=100, density=True)
plt.show()

Histogram constructed from 10,000 realizations of 'difference'

It looks much better than the brush-like image from the previous section! Using a histogram, you can estimate the probability that a random variable will fall within a specific interval. To do this, you need to evaluate the area of the histogram bounded by this interval. The higher the histogram, the greater the probability of falling into the interval. In our example, the chance that the difference falls into the interval $[7500, 10000]$ is much smaller than the chance of falling into the interval $[0, 2500]$ , despite their equal widths.

Several bars of the histogram above are very different from their neighbors. A possible reason for this may be that the number of realizations is not large enough compared to the number of bins. Let's increase the number of realizations to 1,000,000!

NUMBER_OF_TRIALS = 1_000_000
realizations = []
for i in range(NUMBER_OF_TRIALS):
    realizations.append(difference(visits()))
    
plt.figure(figsize=(10, 4))
plt.hist(realizations, bins=100, density=True)
plt.show()

Histogram constructed from 1000,000 realizations of 'difference'

As we can see, the histogram has become smoother, but it still consists of small steps. Can we make the width of the bins very small and the number of trials very high so that the steps become indistinguishable? For the random variable difference, we can't do this: this variable takes integer values, so as soon as the width of the bins becomes narrower than 1, there will be bins with no realizations, and we will again begin to see the brush-shape.

However, there are random variables for which the steps can be made indistinguishable by reducing the width of the bins to zero.

Continuous random variables

A random variable is called continuous if it takes each of its values with zero probability, but there are values inside the arbitrarily small neighborhood of this value with positive probability. The smaller the neighborhood, the smaller the probability. Continuous random variables are a very cool mathematical model, but they are quite complex, so we won't define them formally. Typically, this model is used if, even in a very large set of realizations, all, or almost all, values are different.

As an example, consider an experiment where we generate a number from 0 to 1 as follows: we generate 15 digits from 0 to 9 and interpret them as the decimal places of our number:

def number_from_0_to_1():
    digits = random.choices('0123456789', k=15)
    result = '0.' + ''.join(digits)
    return float(result)

print(number_from_0_to_1())

0.958417331073877

Although the number of possible values is finite and the probability of each value is positive, these probabilities are so small that even among 1,000,000 realizations, all values will almost certainly be different:

realizations = [number_from_0_to_1() for i in range(1_000_000)]

unique_values = np.unique(realizations)
print(len(unique_values))

Let's draw a histogram for our 1,000,000 realizations using 100 bins:

plt.figure(figsize=(10, 4))
plt.hist(realizations, bins=100, density=True)
plt.show()

Histogram constructed from 10,000 realizations of 'number_from_0_to_1'

The histogram has roughly the same height from 0 to 1. This means that the probability of falling into a particular interval does not depend on where it is located, only on its width.

To make number_from_0_to_1 a "true" continuous random variable, we would have to generate an infinite number of digits instead of 15. Of course, you can't generate or store even one such number, so in real life, we do not use true continuous random variables. However, it is often more convenient to analyze continuous random variables and then consider float numbers, that we actually use, as close enough approximations.

If you draw a histogram for an infinite number of realizations of a continuous random variable and start to decrease the width of the bins, then the histogram will converge to a figure bounded by some curve. The function that this curve graphs is called the density of the random variable. Since the area of the histogram always equals 1, the area under the density is also 1.

Like the probability mass function, the density uniquely describes a continuous random variable — you can't distinguish two random variables that have the same density by analyzing only their realizations.

For example, if we did this procedure for a "true" random variable derived from number_from_0_to_1, we would get the following limit histogram:

Density of the 'number_from_0_to_1'

Another example: exponential random variable

Suppose we selected $n$ random points on a segment of length $L$ . We can think of this as $n$ cars parked along the street of length $L$ meters, or as the arrival times of $n$ visitors over $L$ minutes. If $n$ is large enough, then the distances between neighboring points of the segment will be random variables that are distributed almost like an exponentially distributed or simply exponential random variable with the parameter $\lambda = n / L$ . The parameter $\lambda$ must be positive and can be interpreted as frequency: how many cars are parked on average on one meter of the street, how many visitors come on average within one minute, etc.

Exponential random variables are easy to generate using the method random.expovariate(lambd):

print(random.expovariate(0.1))

4.035102778884848

The density of the exponentially distributed variable with parameter $\lambda$ is, curiously enough, exponential: it is equal to $\lambda e^{-\lambda x}$ for positive $x$ and zero otherwise.

Let's make sure that the distance between uniformly distributed points really does look like an exponential random variable. To do this, we will generate 10,000 points on the interval from 0 to 10,000, draw a histogram of the distances between neighboring points, and compare it with the density $e^{-x}$ of the exponential random variable with $\lambda = 1$ .

points = [random.uniform(0, 10000) for i in range(10000)]
points.sort()

lengths = []
for curr, nxt in zip(points, points[1:]):
    lengths.append(nxt - curr)

plt.figure(figsize=(10, 4))
plt.hist(lengths, bins=100, density=True)

x = np.linspace(0, 12, 1000)
plt.plot(x, np.exp(-x), linewidth=2)

plt.show()

Histogram constructed from 10,000 lengths together with the density of the exponential distribution

Conclusion

A random variable is a function that takes as input the outcome of an experiment and returns a number. Two common types of random variables are discrete and continuous random variables.

Continuous random variables take each of their values with zero probability, but in any neighborhood of this value they occur with positive probability. It is convenient to analyze them using histograms. As the number of realizations tends to infinity, and the width of the histogram columns tends to zero, then in the limit we will get a density of a continuous random variable. Density is an analogue of the probability mass function and in the same way uniquely determines a random variable.

In this topic we looked at two examples of continuous distributions: uniform and exponential.

The probability that a uniform on $[a, b]$ variable will take a value within the interval $[c, d] \subset [a, b]$ does not depend on the position of $[c, d]$ , only on its length. The density of this variable is equal to $\frac {1}{b - a}$ inside $[a, b]$ and zero otherwise.

If we consider a large number of uniformly distributed on the same interval random variables, then the distance between neighboring variables will have an exponential distribution. The density of the exponential distribution depends on the frequency $\lambda$ and is equal to $\lambda e^{-\lambda x}$ for positive $x$ and zero otherwise.

How did you like the theory?

Report a typo

Continuous random variables

Histogram

Continuous random variables

Another example: exponential random variable

Conclusion

Related topics