MathStatisticsDescriptive statistics

Diagrams: how to build and read

7 minutes read

Statistics is a field that offers tools to gather, analyze, and interpret data. Diagrams are crucial in this process, as they simplify the visualization of complex information.

Importantly, diagrams assist in identifying patterns, noticing irregularities, and enhancing understanding of your data. Indeed, viewing a graph is significantly easier than analyzing a table crammed with redundant figures. When you aim to share your findings with a non-technical audience, a concise diagram is an effective tool!

If you don't know your destination, no wind is favorable.

Firstly, understand that each variable can adopt a specific set of varied values. Numeric variables denote measurable amounts, such as age, income, or temperature. Conversely, categorical variables symbolize qualitative data that can adopt a few definite values, like dog breeds, cuisine types, or music genres.

Next, you must decide whether your diagram will represent a single variable or embrace a larger scope, featuring multiple variables. This decision is crucial as it determines the kind of diagram that best depicts the data. A single variable may necessitate a histogram, while multiple variables may call for a scatter plot or line chart. We'll explore these examples more in subsequent sections. In upcoming topics, you'll become familiar with all possible diagrams.

By considering Seneca's point of view, determine the diagram's purpose. It could be to illustrate a distribution, identify relationships, or pinpoint trends and outliers. Understanding why you're crafting a diagram aids in selecting the right kind, ensuring the end result is beneficial.

Building the diagram

The axes form the basis of your diagram. It's essential to understand what each axis represents. The x-axis typically stands for the independent variable, while the y-axis represents the dependent variable. Knowing the function of each axis is vital for accurate diagram interpretation.

The scale of the axes also plays a significant role. It can drastically alter your perception of the data. Regardless of whether it's in dollars, pounds, or percentages, the scale needs to be both consistent and suitable for the data displayed. Remember to add symbols like % or $ to provide additional context.

Identifying patterns or trends in the data can provide valuable insights. A trend might demonstrate steady temperature growth over time, while a pattern could expose the cyclic nature of yearly sales. Before plotting a diagram, you might not know anything about your data, but with its aid, you can uncover intriguing patterns.

Outliers or anomalies in the data can be just as crucial as the trends and patterns. They could point out data collection errors or reveal a rare yet noteworthy event that requires further exploration. Why does a particular value stand out so much? This could prompt a deeper investigation. Now, let's put theory into practice and get started.

"

Example: distribution

A histogram can help you visualize the distribution of a single numerical variable. Let's use 153 temperature records as an example –note that the temperatures are in degrees Fahrenheit:

Temperature

67

72

74

62

56

To build a histogram, you begin by identifying the data range, the minimum and maximum values. Then, divide the range into bins of the same length. For example, you can divide it into 30 bins, or base it on your personal criteria. Keep in mind, the number of bins in your histogram can drastically affect its interpretability. An excess of bins can lead to a cluttered histogram, while too few bins can oversimplify the data. The best bin size gives a distinct, clear insight into the data distribution.

You count the number of observations within the range of each bin. Regarding these observations, the height of each bin reflects the count. Thus, you can quickly assess where data clusters or where it's sparse. As an example, within the range 505650 - 56, if the values are 50.350.3, 52.852.8 and 57.457.4, then the value corresponding to the bin is 33. The resulting histogram will look something like this:

Temperature distribution

You may observe that the majority of the temperatures range from 75 to 90 degrees. Extreme temperature values, both high and low, are scarce. In addition, the distribution appears somewhat symmetrical.

Let's introduce colors for improved clarity. Suppose the temperatures correspond to three different cities. This means that the city is a categorical variable and each city has its individual temperature distribution.

Temperature

Country

67

Germany

72

Switzerland

74

Austria

62

Austria

56

Switzerland

Comparing the temperature distributions of the three cities would be an interesting exercise here. Another option to histograms is the density plot; you can think of it as a smoother version of a histogram. Let's see how the visualization changes with these parameters:

Distribution per country

"

Example: time series

A time series chart is an excellent tool for illustrating how a variable changes over time. In the financial world, for example, you could use this to monitor the daily closing price of a stock or the annual returns on an investment.

The dataset we're using represents the values of the SMI financial index, a stock market index comprised of the top 20 stocks in the Swiss market listed on the Zurich Stock Exchange. The data is collected daily from 1996 to 1998:

Date

Value

1991-01-01

1678.42

1991-01-02

1688.94

1991-01-03

1679.21

1991-01-04

1684.47

1991-01-05

1687.95

When creating a time series, you place the dates on the x-axis and their corresponding values on the y-axis. The key lies in connecting the points with a straight line to depict the variable's trend over time:

SMI financial index

The graph shows a distinct upward trend. Note the unusually large increase towards the end of 1992. Furthermore, as time progresses, the trend continues upward but faces more frequent peaks and valleys, indicating increased volatility.

Given the simplicity of the line plot, you can compare numerous time series simultaneously. Suppose we introduce a new categorical variable that signifies four different indices:

Date

Value

Index

1991-01-01

1678.42

SMI

1991-01-01

1629.35

DAX

1991-01-01

1773.82

CAC

1991-01-01

2444.35

FTSE

1991-01-02

1688.94

SMI

Financial indices comparison

In this scenario, it's evident that all the time series depict an increasing trend. The CAC index has the lowest values, and even though the FTSE index begins at higher values, it is overtaken in early 1995. Can you identify any other noticeable patterns?

Conclusion

  • First, define the purpose of the diagram and what message you want it to communicate. Then, analyze your variables to determine which diagram would suit best.

  • Ensure you attend to the axes, labels, legends, and scales. Your diagram should be easy to read and understand. Create straightforward labels for the axes and include a legend when needed to explain any color or symbol coding. Aim to enhance both the appearance and readability of your diagram.

  • A histogram helps you see the distribution of a numerical variable. The number of bins shapes the appearance of your diagram. In contrast, a density plot smoothens the image of your histogram and enables you to compare multiple distributions at once.

  • A line plot displays how a variable alters with time, making it easy to identify trends, outliers, and compare multiple time series simultaneously.

2 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo