MathStatisticsDistributions

Chi-squared distribution

5 minutes read

The chi-squared distribution with parameter kk, denoted as χk2\chi_k^2, is the gamma distribution with parameters k2\frac{k}{2} and 12\frac{1}{2}. The parameter kk is called the number of degrees of freedom (df).

In particular, a random variable XX follows the chi-squared distribution with kk degrees of freedom when its distribution is Gamma(k2,12)\operatorname{Gamma}(\frac{k}{2}, \frac{1}{2}). So, its probability density function is:

f(x)={12k/2Γ(k2)xk21ex2if x00if x<0f(x) = \begin{cases} \frac{1}{2^{k/2} \Gamma \left(\frac{k}{2}\right)} x^{\frac{k}{2} -1} e^{ - \frac{x}{2}} & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases}

That was pretty easy, don't you think? Now let's take a look at its main properties.

Basic properties

Since the χk2\chi_k^2 distribution is just the Gamma(k2,12)\operatorname{Gamma}(\frac{k}{2}, \frac{1}{2}) distribution, you can discover its properties by taking advantage of what you already know about the gamma distribution. To begin with, the expectation and variance are quite easy to remember:

E(X)=k,Var(X)=2k\operatorname{E}(X) = k, \operatorname{Var}(X) = 2k

It's always useful to know the graph of the density functions and see how they vary as the parameters change. Since the only parameter is the degrees of freedom, this is straightforward:

Density of the chi-squared distribution for different degrees of freedom

Here, you can notice some interesting points:

  • It's skewed to the right.

  • When k=1k=1, it's unbounded near 00.

  • When k=2k=2 it's exponentially decreasing. Actually, the chi-squared distribution with 22 df is a exponential distribution, can you guess its parameter?

  • When k>2k>2, it's unimodal with a peak at k2k – 2.

  • Since E(X)=k\operatorname{E}(X) = k, it's roughly centered around kk. When, as kk increases, the distribution becomes more spread out due to the fact that Var(X)=2k\operatorname{Var}(X) = 2k.

In fact, as the degrees of freedom increase, the distribution appears increasingly symmetrical:

The chi-squared distribution looks like the normal distribution when the degrees of freedom increase

If you look closely, it looks a lot like the normal distribution, and this is precisely the case. Head over to the next section for the details!

Relationship with the normal distribution

This is the most important property about your new distribution and marks the beginning of its close connection with the normal distribution. When you have a random variable ZZ with standard normal distribution, it turns out that Z2Z^2 has a chi-squared distribution with 11 degree of freedom. But this is not all, since this result remains the same as you add more variables.

If Z1,Z2,ZnZ_1, Z_2, \dots Z_n are independent random variables with standard normal distribution, then:

Z12+Z22++Zn2  χn2Z_1^2 + Z_2^2 + \dots + Z_n^2 \ \sim \ \chi_n^2

This allows you to interpret a random variable X χn2X \sim \ \chi_n^2 as the sum of nn independent and identically distributed random variables.

Coincidentally, the famous central limit theorem states that the sum of independent and identically distributed random variables approximates the normal distribution. Actually, the more variables you add, the better the approximation will be. In future topics you will learn more about this fundamental result, but for now you've just solved the mystery of the previous section.

Normal sampling

Let's delve into inferential statistics. A random sample, with a size denoted by nn, is a collection of nn independent and identically distributed (i.i.d) random variables, denoted by X1,X2,,XnX_1, X_2, \dots, X_n. These variables share the exact same distribution. You make inferences about this distribution by summarizing the sample, such as adding its values.

The most common tools for analyzing random samples are the sample mean and the sample variance. The sample mean gives you an idea of where the sample concentrates, while the sample variance provides an insight into how spread out the variables are from the mean. They are defined as follows:

Xˉ=i=1nXin, S2=i=1n(XiXˉ)2n1\bar{X} = \frac{\sum_{i=1}^n X_i}{n}, \ S^2 = \frac{ \sum_{i=1}^n \left( X_i - \bar{X} \right)^2 }{ n-1 }

Usually, it's rather complex to determine the distributions of these new variables. However, when you obtain the sample from a normal distribution with parameters μ\mu and σ2\sigma^2, you can ascertain the distribution of both these variables. As expected, Xˉ N(μ,σ2n)\bar{X} \ \sim N(\mu, \frac{\sigma^2}{n}). But what about the sample variance? Here, the chi-squared distribution provides the answer!

If X1,X2,XnX_1, X_2, \dots X_n are independent random variables with a common distribution N(μ,σ2)N(\mu,\sigma^2), then:

n1σ2S2=i=1n(XiXˉ)2σ2  χn12\frac{ n-1 }{ \sigma^2 } S^2 = \frac{ \sum_{i=1}^n \left( X_i - \bar{X} \right)^2 }{ \sigma^2 } \ \sim \ \chi_{n-1}^2

This finding is crucial for statistical inference. Many procedures and results rely on it, making the chi-squared distribution a important in statistics. But how useful is it?

Applications in inferential statistics

You can find the chi-squared distribution in many areas of statistics; it's nearly as widespread as the normal distribution. One striking aspect is its universality; regardless of the data's original distribution, essential transformations of the data (known as statistics) often have chi-squared distribution when you're dealing with large sample sizes.

This apply particularly useful because it enables you to replace complex original distributions with the chi-squared distribution, allowing you to perform calculations involving probability, understanding the mean, variance, and other aspects of your sample.

The two most utilized techniques are the goodness-of-fit test and the independence test. You use the former to see if a categorical data sample aligns well with a theoretical probability distribution you believe it might follow. For instance, imagine you're investigating the probability of a person's blood type being O, A, B, or AB. If you've guessed the probability of each type, denoted as pOp_O, pAp_A, pBp_B and pABp_{AB}, you can then gather some data and check if observed behavior aligns with your hypothesized probabilities.

With the independence test, you're trying to spot a correlation between two categorical variables. Suppose you're examining different medicine options for a disease to see if they're more effective in certain age groups. After gathering some data, you use the chi-squared distribution to determine if this relationship actually exists, or if the effectiveness of the medications isn't dependent on age.

Conclusion

  • The chi-squared distribution with parameter kk is denoted denoted as χk2\chi_k^2. Its parameter is know as the degrees of freedom.

  • This distribution is a particular case of the gamma distribution. In other words, a random variable XX has a chi-squared distribution with kk degrees of freedom when its distribution is Gamma(k2,12)\operatorname{Gamma}(\frac{k}{2}, \frac{1}{2}).

  • When X χn2X \sim \ \chi_n^2, the pdf is given by

    f(x)=12k/2Γ(k2)xk21ex2f(x) = \frac{1}{2^{k/2} \Gamma \left(\frac{k}{2}\right)} x^{\frac{k}{2} -1} e^{ - \frac{x}{2}}
  • If X χn2X \sim \ \chi_n^2, then E(X)=k\operatorname{E}(X) = k and Var(X)=2k\operatorname{Var}(X) = 2k.

  • If Z1,Z2,ZnZ_1, Z_2, \dots Z_n are independent random variables with standard normal distribution, then

    Z12+Z22++Zn2  χn2Z_1^2 + Z_2^2 + \dots + Z_n^2 \ \sim \ \chi_n^2
  • If X1,X2,XnX_1, X_2, \dots X_n are independent random variables with distribution N(μ,σ2)N(\mu,\sigma^2), then

    i=1n(XiXˉ)2σ2  χn12\frac{ \sum \limits_{i=1}^n \left( X_i - \bar{X} \right)^2 }{ \sigma^2 } \ \sim \ \chi_{n-1}^2
2 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo