This topic will be devoted to statistics and probability theory. You've probably heard that with the help of statistics, analysts find out people's opinions; with the help of probability theory, they predict whether the stock will fall in the market. These are two outstanding sciences that help us study this world. But the most important thing to understand about statistics and probability theory is that they are the two inverse sciences.
In probability theory, random variables and random experiments are considered as known initially. But an experiment is often just a set of some output results with unknown attributes. However, from these results, it is required to conclude the phenomenon's properties — this is already statistics.
To better understand this, we can consider the following example. The theory of probability is a situation when you have not watched Game of Thrones yet, but you know that this is a good series and you are wondering what reviews users leave. Statistics is a situation when you look at reviews and try to understand if this is similar to Game of Thrones.
Still not clear? Let's consider the following picture:
Let's recall the main points of the theory of probability, and then talk about statistics.
Probability theory
Probability theory is the science that studies probabilities. Natural sciences can do a lot, but some things simply can't be described with a given precision. For example, a classic problem in probability theory is tossing a coin. This problem can be solved completely by calculating the force and torque at the moment of the throw, the effect of air viscosity and the center of mass of the coin. Thus, it is possible to calculate whether it will be tails or heads knowing all the input parameters.
But the example above is rather cumbersome, for sure a whole laboratory of scientists is needed to fully describe this process. Difficult, isn't it? But by tossing the coin a few times and observing, we can already tell something about this process. In this case, the process can now be described not using the laws of physics, but by a so-called random experiment. A coin toss has two outcomes, heads or tails; these probable outcomes are called a sample space of the random experiment.
It is important to note here that in this case, falling heads or tails is no longer a definite result, as in physics, but rather has the character of a random value. Let us remind you that there are accidental, reliable and impossible events. Probability is a function that describes events in these terms. This is a function that can take a value from 0 to 1, which is a characteristic of this random event. If the probability equals 0, the event is almost impossible, if it's 1, the event always occurs.
Let's go back to the Game of Thrones example. The classic problem of the probability theory will ask about the probability of an event or about a specific event. Find the probability that a set of comments will have a certain content. For example, you can see that on the forum the distribution of reviews is the following: 0.4 — negative, and 0.6 — positive. Someone figured it out for you. Think about the probability that the first three comments are to be positive.
Statistics
All the knowledge that the probability theory describes, provides a basis to understand the distribution of random events, which is the basis for mathematical statistics. In statistical problems, on the contrary, a sample (in statistics and machine learning, "data" is the more commonly used term) collected from the results of a random experiment is known, by which the distribution should be determined.
Let's remember the comments on the series. In probability theory, we will most likely be given the condition that a good or bad comment occurs with a certain probability. However, in reality, we probably don't know anything about these probabilities when we just start reading the comments. However, we can start reading the comments and identify whether they are positive or negative. How many comments do you need to read to be sure that the distribution has a certain form? The law of large numbers from the theory of probabilities will help us. The law of large numbers suggests that the process has stability with a sufficient number of experiments. Thus, with a sufficient amount of data from random experiments, the statistic can determine important characteristics about the processes.
In a more general form, statistics can be defined as the science of gaining information from data. This includes both the methods of extracting information themselves and methods of collecting, storing and organizing data. The basis for the necessity of statistics as a science lies in the nature of data:
Whole data cannot be collected.
Whole data cannot be measured without errors.
The structure of the data can be complicated.
Methods of statistics can be roughly divided into three following groups. Design — how to collect data; description — summarizing and exploring the data; and inference — data generalization and making predictions. Design and description form a branch of statistics called descriptive statistics. On the other side, inference forms another branch of statistics called inferential statistics, which is the theoretical basis for machine learning. Despite this, these two types of statistics are interrelated. Before using inferential statistics methods, the data often needs to be organized and described, which is what descriptive statistics does.
A sample is a fundamental concept in statistics representing a subset from the entire explored space — the population. The representativeness of a sample is its main characteristic, which statistics try to formalize as much as possible for any problem. The sample, ideally, should be something similar to a picture with poor focus. The more values we add to the sample, the clearer the picture becomes and the closer it is to the reality that we want to photograph. This is what statistics calls likelihood.
Having summarized all the information above, you can make the following mental association for yourself and never again get confused about what is the theory of probability and what is statistics.
Probability vs Likelihood
In many of the tasks in this course, you will come across the words "probability" and "likelihood". Even though in many problems these two terms are closely related, their meaning is different, but the essence lies in the same idea: these are attributes of the probability theory and statistics.
The probability gives an estimation of the chances for a population that the selected data has certain characteristics. In contrast, the likelihood — with a limited set of data gives an estimation of the chances that the data belongs to that population. In other words, the likelihood can be viewed as the inverse measure of probability.
To get a feel for the concepts of probability and likelihood compare two ideas:
"What is the probability that the rating in Game of Thrones will be above eight?"
"What is the likelihood that the series is Game of Thrones if the rating is above eight?"
Conclusion
In this topic, we have learned that without statistics, raw data is just data. Statistics are used to transform random experiments into meaning. We figured out how statistics intersect with probability theory:
Probability theory is a science that studies the rules of random phenomena.
Statistics is a set of mathematical methods and tools that allow us to work with data.
And these are two complementary sciences.
Statistics has two critical branches — descriptive statistics, which deals with describing data and presenting them in different forms, and inferential statistics that extrapolates results from samples to the entire population.
We've seen that the difference between the concepts of probability and likelihood is similar to the difference between probability and statistics.
Let's practice!
Read more on this topic in Diving into Statistical Programming on Hyperskill Blog.