MathStatisticsInferential Statistics

Sampling

11 minutes read

Sampling is a fundamental concept in statistics that involves selecting a subset of individuals or items from a larger population to make inferences about the entire group. By studying a sample rather than the whole population, statisticians can gather insights more efficiently, saving time and resources while maintaining accuracy.

In this topic, you'll learn about different sampling methods, including selecting samples with or without replacement, the importance of stratified sampling, and techniques to ensure your sample accurately represents the population.

Sampling

In statistics, a population refers to the entire group of individuals, items, or data points sharing a common characteristic. For instance, a population could include all the students in a school. However, due to the large size of many populations, collecting data from every individual can be impractical. This is where sampling comes in.

Pasted illustration

Sampling involves selecting a smaller subgroup, known as a sample, from the population. The key to effective sampling is ensuring that the sample is representative of the population, meaning it accurately reflects the characteristics, trends, and patterns of the whole group. This means that the insights gained from the sample can be generalized to the broader population.

To achieve a representative sample, it's important to choose the sample in a non-biased manner. This helps ensure that the data collected is fair and accurately represents the population as a whole. By carefully selecting a sample that mirrors the diversity and characteristics of the population, researchers can confidently apply their findings broadly.

Common probability sampling methods

Simple random sampling is the most basic form of probability sampling, where each population member has an equal chance of being selected. This method ensures that every possible sample has the same likelihood of being chosen, minimizing bias. For instance, if a company wants to survey 1,000 customers from its 50,000-customer database, it can assign each customer a unique number from 1 to 50,000. Using a random number generator, they select 1,000 numbers from that range. The survey is sent to the 1,000 customers whose email numbers match the randomly selected ones. This approach gives each customer an equal chance of being part of the survey, ensuring that the sample accurately represents the broader customer base.

Systematic sampling involves selecting every nth individual from a population list after choosing a random starting point. This method is straightforward and ensures a level of randomness while being easier to implement than simple random sampling. For example, if a company wants to survey 1,000 customers from a database of 50,000, it might start with the 7th email and select every 50th entry thereafter until 1,000 are chosen. This method works well for evenly distributed populations, ensuring that the sample is spread evenly across the entire list.

Stratified sampling divides a population into subgroups, or strata, based on characteristics like age, gender, or income. A random sample is then taken from each stratum, ensuring that all relevant subgroups are represented in the final sample for more accurate results.

For example, if a company wants to understand customer opinions by age, it might divide 50,000 customers into three age groups: 18-29, 30-49, and 50+. In Proportionate Stratified Sampling, the company would select 600, 450, and 450 customers from each group, respectively, matching the sample size to each group's population proportion. In Disproportionate Stratified Sampling, the company might choose 500 from each group to ensure equal representation, regardless of group size. This method is especially useful when it's important to represent specific subgroups accurately, reducing sampling error and improving precision.

Sample size and representativeness

The size of the sample significantly influences how representative it is of the population. For example, if you're dealing with a population of 5,000 individuals:

  • A sample size of 5,000 would be very representative because it includes everyone’s views.

  • A sample size of 1,000 would still be quite representative; the data collected would likely reflect the overall population's opinions.

  • A sample size of just 10 would likely not be representative. This small sample would be insufficient to capture the diversity of views within the population, making the results biased.

When determining sample size, remember:

  • For populations under 100, it's ideal to include every individual in the sample.

  • For larger populations, the sample should be at least 100 items or 10% of the population, whichever is larger.

For instance, in a systematic sampling example, if you're selecting a sample of 6 towns from a population of 18 towns, you'd divide the total number of towns by the sample size to determine the interval. Here, 18 ÷ 6 = 3, so you would select every 3rd town to ensure the sample is well-distributed and representative.

Choosing the best sampling method

Consider a school with 500 students divided into five year groups. The distribution of students across these groups is as follows:

Year group

7

8

9

10

11

Number of student

50

180

70

100

100

Joseph aims to determine how many students bring packed lunches by selecting a sample of 50 students. The most appropriate sampling method would be stratified sampling.

Stratified sampling involves selecting items from each subgroup (or stratum) in proportion to their presence in the overall population. This method is ideal because:

  • Each year group has a different number of students.

  • Each year group constitutes separate strata.

  • This approach ensures that the sample will be the most representative of the population.

Stratified sampling is most representative when the number of items chosen from each of the strata is proportional to the total number of items in that strata within the population. This way, the sample accurately reflects the population's overall structure.

Stratified sampling — calculations

In stratified sampling, the number of items selected from each stratum should match the stratum’s proportion within the overall population. This method guarantees that the sample accurately reflects the population's composition.

Consider a town with a population of 100, divided into three distinct strata:

  • 50 women

  • 30 men

  • 20 children

A researcher wants to survey a sample of 10 townsfolk to determine how many people like living in the town. To maintain proportionality, the researcher would calculate how many individuals from each strata should be included in the sample:

  • Women: Since women make up 50100\frac{50}{100} of the population,50100×10=5\frac{50}{100} \times 10 = 5 women should be randomly chosen.

  • Men: Men represent 30100\frac{30}{100} of the population, so 130100×10=3\frac{30}{100} \times 10 = 3 men should be randomly chosen.

  • Children: Children constitute 20100\frac{20}{100} of the population, therefore 20100×10=2\frac{20}{100} \times 10 = 2 children should be randomly chosen.

This distribution is proportional to the overall population, ensuring that the sample is representative.

The general formula to determine the number of items from each strata in a stratified sample is:

ni=(Size of StrataSize of Population)×Sample Sizen_i = \left(\frac{\text{Size of Strata}}{\text{Size of Population}}\right) \times \text{Sample Size}

This method ensures that each subgroup within the population is appropriately represented in the sample, preserving the integrity of the sampling process.

Sampling with replacement

Sampling with replacement involves drawing items from a population, recording the outcome, and then returning the item to the population before the next draw. This ensures that each item can be selected more than once and that each draw is independent of the others. This method is commonly used in probability calculations, where independence of selections is crucial.

Consider a population of 5 individuals: Alice, Bob, Charlie, Dana, and Eve. If you want to sample 2 individuals with replacement, you would draw one name, record it, put it back, and then draw another name. Possible outcomes include:

  • Alice, Alice

  • Alice, Bob

  • Bob, Charlie

  • Charlie, Dana

  • Eve, Alice

In this scenario, each individual has a 1 in 5 chance of being selected each time. The probability of drawing any specific pair can be calculated using the multiplication rule. For instance:

The probability of selecting Alice twice is:

P(Alice, Alice)=15×15=125P(\text{Alice, Alice}) = \frac{1}{5} \times \frac{1}{5} = \frac{1}{25}

Similarly, the probability of selecting Alice first and Bob second is:

P(Alice, Bob)=15×15=125P(\text{Alice, Bob}) = \frac{1}{5} \times \frac{1}{5} = \frac{1}{25}

Since each draw is independent, the probability of each possible outcome is the product of the probabilities of each draw.

Now, let’s explore what happens when sampling without replacement, where each item is not returned to the population after being selected. This change affects the probability calculations and introduces dependencies between draws.

Sampling without replacement

Sampling without replacement changes probability calculations significantly because each selected item is not returned to the population before the next draw. This dependency affects the likelihood of subsequent selections. For instance, with a population of 5 individuals—Alice, Bob, Charlie, Dana, and Eve—if you want to sample 2 individuals without replacement, you can't select the same person twice. The possible outcomes would include:

  • Alice, Bob

  • Alice, Charlie

  • Bob, Charlie

  • Charlie, Dana

  • Dana, Eve

In this scenario, when you choose the first individual, you have a probability of 15\frac{1}{5}51 for each possible choice. After selecting one individual, you have only 4 remaining individuals for the second selection, so the probability for the second choice is 14\frac{1}{4}41.

The probability of selecting Alice first and Bob second, without replacement, is:

P(Alice, Bob)=15×14=120P(\text{Alice, Bob}) = \frac{1}{5} \times \frac{1}{4} = \frac{1}{20}

Similarly, the probability of selecting Alice first and Charlie second is:

P(Alice, Charlie)=15×14=120P(\text{Alice, Charlie}) = \frac{1}{5} \times \frac{1}{4} = \frac{1}{20}

The dependency between the two selections leads to different probabilities compared to sampling with replacement. As the sample size increases relative to the population, these effects become more pronounced. To measure the extent of these effects, you can calculate the covariance, which reflects how strongly the probabilities of two items are linked. A covariance of zero indicates no difference between sampling with and without replacement.

Conclusion

Stratified sampling ensures that different subgroups within a population are proportionally represented in a sample, leading to a more accurate and representative snapshot of the entire population. Sampling can be conducted with or without replacement: in sampling with replacement, each selected item is returned to the population, allowing for the possibility of choosing the same item more than once, while in sampling without replacement, each item is selected only once, affecting the probabilities of subsequent selections. These methods together help researchers obtain a sample that accurately reflects the population's diversity, enabling valid inferences about the entire population.

How did you like the theory?
Report a typo