MathStatisticsDescriptive statistics

Population and sample

10 minutes read

Mathematical statistics today is actively developed and used in big data technologies or machine learning. However, here we'll speak about the most basic term, sample, and why it is so important for data processing and analysis.

Population

The first term which we consider is a statistical population or just population. It is not only about people or animals, as we're used to thinking. The population in math statistics is the whole analyzed group of objects, existing or hypothetically considered (maybe infinite). We emphasize "whole" because it is the key point in understanding this concept. Examples of populations are:

  1. Existing objects

    • All people on the Earth

    • All Moscow men

    • All American stores that sold smartphones in 2015

  2. Hypothetically considered:

    • All possible hands in poker

    • All prime numbers

10 random people on the Sunday bus or your friend's three children are not populations.

General math statistics' aim is to investigate the chosen population and produce some information about it. However, as you could suspect, sometimes it is impossible to gain information about all population representatives. Imagine, that you would like to know which genre of literature is the most popular in the city with 5000000 citizens. To obtain an accurate result you should ask each resident, which is not an easy task. Some people don't like to talk to strangers, others are extremely busy life and cannot be found without a preliminary agreement. Not to mention how much this research would cost!

That is why almost no one ever analyzes the whole population. This is what samples are designed for!

Sample and sampling

A sample is a subset of the population that is used for the analysis of the whole population. Let's suppose that you are a smartphone manufacturer and would like to research Moscow State University students to examine which phone parameter is the most important for them. So, now you need to organize your sample. How to do that? Let's consider several most wide-spread types of sampling (a process of sample organization).

  • Simple random sampling

This type of sampling assumes the uniformity of the population. In our case, it means that all students are considered as identical survey participants. We don't care about their gender, course of study, or financial status. We just label each MSU student with a number and take a random subset of this range for our purposes.

Random sampling of the population.

Pic.1 Illustration of the random sampling

Notice that for simple random sampling, we should have a full list of the population representatives. It is not always available.

  • Convenience sampling

Convenience sampling is convenient for you, as a researcher. There are no special methods or prerequisites in selecting: you just go to the most popular place with a huge amount of the respondents (in our case it could be the student canteen) and conduct a survey among students that are in the canteen at this moment. This type of sampling is usually chosen when the researcher is restricted in resources such as time or money.

A convenience sampling of the population.

Pic.2 Illustration of the convenience sampling

Benefit of such sampling is that you could conduct your survey without a list of the whole population. But try to think about the disadvantages of this method. Do all students prefer to eat in the cafeteria? What percentage of all students will go to the canteen on that particular day?

  • Quota sampling

This sampling is somewhere between convenience and simple random samplings. For this, you should have some pieces of knowledge about the group's strata (for instance, the percentage of ages group: children, adults, and retirees). But you still don't have to know about all the population representatives. For instance, from another conducted survey you knew that there are 60% females and 40% males among MSU students. So, you'll organize the sample according to this distribution. If you would like to interview 100 people than 60 of them should be females and 40 — males.

A quota sampling of the population.

Pic.3 Illustration of the quota sampling

There are many types of sampling, besides the 3 considered above. However, all of them are divided into two general groups: probability (such as simple random sampling) and non-probability (such as convenience or quota samplings). The samplings of the first type use probability methods for generating the sample, while the samplings of the second type are guided by other parameters (like a place or predefined distributions of the sample representatives).

Statistic vs parameter

As was mentioned, samples allow analysts to gain an objective illustration of the population according to a particular aspect. They also help to save time and resources, which is extremely important, since lots of researches are conducted every day. Now that we've discussed how we can get a sample, let's now define what a sample is from a mathematical statistics perspective.

Suppose, that set XX is a set of all population representatives. We collect the sample of mm elements xmx^m using any sampling methods:

xm={x1,x2,...,xm}x^m= \{x_1,x_2,...,x_m\}, where xiXx_i ∈ X

Now let's talk about sample characteristics. A statistic is any quantity computed using sample values. It is used for statistical analysis. But, remember, the main goal of sampling is to depict the investigated quality for the whole population and its characteristics, parameters.

A parameter is the same thing as a statistic except it is a quantity computed using all population values. However, direct parameter evaluation is usually inconceivable because it is pretty much impossible to get data about each representative of the population. A statistic that estimates parameter is called an estimator.

Let's look at the example:

"The recent survey revealed that 56% of French students have visited Disneyland at least once."

Here 56% is a statistic, namely the percentage of the surveyed French students who visited Disneyland at least once — sample. The population there is the set of all French students, and the parameter is the percentage of all France students who visited Disneyland at least once.

It is extremely simple to remember which characteristic describes which set:

Statistic is a Sample characteristic, Parameter is a Population one.

Representativeness

Before we talk about new concepts, let's remember the purpose of creating samples — it is the analysis of the population but with less time and resources expenses. Thus, it is so important to generate a representative sample — a sample that adequately replicates the population according to a particular characteristic. By the way, the representative sample's statistic should also adequately estimate the population parameter. Let's go back to the survey in the student canteen. Does that cafeteria sample depict a true situation with students' "average" opinion? Maybe only economics students go to that building and other departments had been left unasked. Important smartphone criteria could be very different for economics and physics students. It is also important to note that there are quite a lot of physicists at MSU. Considering all this, we can say that this sample was unrepresentative.

One sample could be both representative and unrepresentative at the same time! It depends on the characteristics we investigate. For instance, we survey women in some mall (convenience sampling). When is it representative? It is so if we want to present the new lady's perfume and figure out how many women like it. It is logical to conduct such research in the mall where the perfume shop is located because exactly those women are potential customers. At the same time, this sample is unrepresentative in the case of research an average citizens' income because not only women work in that city. And it is unlikely that they are so few that they can be ignored.

Conclusion

Let's summarize what we've discussed in this topic:

  1. Samples are the population's subsets that are used in research to save time/resources.

  2. There are many ways to collect samples: part of them require prior knowledge about the population but could be more objective in some cases. In contrast, other samplings could be organized with the assistance of the situational surveys and also be useful.

  3. A statistic is a quantity sample characteristic. Usually, we try to organize the sample in such a way as to adequately estimate the population parameter.

  4. The main aim of the sample in most cases is to be representative and approximately depict the dependencies in the whole population.

64 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo