Data visualization is a simple and nice tool that aids you in understanding complex data sets and discovering patterns or relationships between variables. One of the most widely used techniques for data visualization is the scatter plot.
A scatter plot is a two-dimensional graphical representation of a set of data. Each data point is plotted on a Cartesian coordinate system, with the x-coordinate representing one variable and the y-coordinate representing another. The purpose of a scatter plot is to visualize the relationship or correlation between two numerical variables, ultimately providing insights into how changes in one variable affect the other.
Visualizing the relationships between variables is crucial in many fields, from business to healthcare to social sciences, as it helps to identify trends, correlations, and potential outliers in the data. By spotting these, you can make informed decisions, develop strategies, or create predictive models.
Discovering relationships
If you had hundreds of pieces of data stored in a table, could you find patterns or relationships between them just by looking at them? What if there were thousands? By graphing all the information you can save hours of work, discover connections between data, and easily share results with coworkers who don't necessarily have as advanced a knowledge of statistics as you.
A scatter plot primarily consists of an -axis (horizontal) and a -axis (vertical), forming a grid where data points are plotted. Each point on the plot represents an observation from the data set with its position along the and axes reflecting its values for the two variables.
The x-axis typically represents the independent variable, while the y-axis represents the dependent variable. The independent variable is the one that is manipulated or controlled in the experiment, while the dependent variable is the one being measured or observed. Your main objective is to investigate the behavior of the independent variable as the dependent variable evolves.
The role of each variable is crucial in portraying relationships, as the pattern of plotted points can indicate whether there's a positive correlation (both variables increase together), a negative correlation (one variable decreases as the other increases), or no correlation.
How to Create a Scatter Plot
Preparing a scatter plot involves several steps. Try following the steps using the data in the table to reconstruct the scatterplot from the previous section.
| X | Y |
|---|---|
| -8 | -25.3 |
| -5 | -3.2 |
| -2 | -13.48 |
| 1 | -2.75 |
| 3 | 6.36 |
| 6 | 13.44 |
| 8 | 30.72 |
| 10 | 21.51 |
- Collecting data: First of all, gather data for both the independent and dependent variables. This could be from an experiment, a survey, or any other data collection method. Typically, you store the data in a table with a column for each variable and a row for each record.
- Determining the range and units for each variable: Identify the minimum and maximum values for each variable to set the range for the x and y axes. Ensure the units of measurement are consistent for all data points. For example, if you have some temperature data in degrees Celsius and others in degrees Fahrenheit, try to convert them all to the same unit of measurement.
- Plotting the data points: For each observation, find the corresponding values on the and axes and mark the point where they intersect.
- Labeling the Axes: Add labels to the axes with the respective variables they represent. Optionally, you can give the scatter plot a title that succinctly describes what it represents.
The more data you have, the more tedious it will be to make the diagram yourself. For this reason, it is common to use software that automates this process. This considerably speeds up your work and allows you to focus on what is most important: analyzing patterns in the data, let's see how to do it!
Analyzing and finding patterns
The simplest way in which variables can be related is linear. As you know, this connection is measured through the correlation coefficient which takes values from to . The larger its absolute value, the stronger the relationship. When it is positive, both variables increase together, while when it is negative, one increases while the other decreases.
The scatter plot is the ideal tool to analyze the correlation. In the following image you can see the general trend of the data for different values of . Note that the steeper the line, the stronger the relationship.
Note that the stronger the relationship, the closer the points are. As the dispersion increases, the correlation is lower and it is more difficult to find a clear trend.
Not all relationships are linear, in the wild variables are often connected in more exotic ways. Even if both increase at the same time, it's possible that the independent variable grows faster or slower than the dependent one. Typical examples of each behavior are quadratic and logarithmic growth, respectively. Note that in the following images the trend is no longer linear. Do you think the latter is unlikely to arise in reality? Wait for the last section!
When variables are unrelated it is impossible to notice any pattern in your diagram. In this case the information on one variable does not provide information about the other.
Another advantage is that you can easily identify outliers in the data. An outlier is a value of a variable that is substantially greater or less than the rest of the values. For example, in the last diagram the point is an outlier that departs substantially from the point cloud because the dependent variable is considerably small. Would you have identified this data so quickly by looking at the records in a table?
Drawing Conclusions
Once you have finished studying a scatterplot, it is time to draw conclusions. Let's see how to do it using real data sets.
Suppose a company has decided to increase its sales by investing in advertising in different media. After compiling the results for television, radio and newspaper, the results are as follows:
- In all three cases there seems to be a positive trend: sales increase by investing more money in advertising. However, the trend is different in each media.
- In general, television seems to be the best means of communication. The trend is stronger (greater slope). More importantly, the data is less spread out, which implies that modeling sales in terms of advertising as a straight line would be a good fit.
- The more money invested in radio advertising, the more sales are achieved. Although the relationship isn't very strong, the real problem is the dispersion of the data. There are times when even though a lot of money is spent, sales are few. For instance, there is an investment of almost 1M in sales! You've discovered an outlier!
- Investing in a newspaper is not the best option. There is no clear relationship between the variables. Although sometimes investing more in advertising generates more sales, in general there is no clear pattern. These are the most dispersed data and without seeing the trend it would seem that the variables are independent.
- In conclusion, the data suggests that investing in newspaper advertising is not a good idea. The safest thing is television, since the correlation is so strong a possible next step would be to build a model such as linear regression. Radio is a good alternative but investing in it would be more uncertain.
One of the most useful qualities of scatter diagrams is that they allow information from other auxiliary variables to be incorporated. For example, if you have a third variable that is categorical, you could fill each dot with a color corresponding to its category.
Let's look at a more realistic example where you take full advantage of the scatter diagram. Gapminder, an organization dedicated to advancing data-driven knowledge on global development, provides a comprehensive data set to explore and understand socioeconomic differences between countries over time. Here you can see a diagram that shows the behavior of life expectancy in relation to GDP per capita. Each observation corresponds to a country over recent years. The continent is a categorical variable and we take advantage of the population size to make points larger than others:
-
The most important thing is that the more GDP per capita, the longer life expectancy.
-
But the trend isn't linear, so the correlation coefficient is close to zero. But then what kind of trend is it? Exactly, it is logarithmic! Who said that logarithms don't appear in real life?
-
Thanks to the other variables we can take our analysis further. The countries with the lowest GDP per capita belong to Africa and Asia. Notably, these countries have life expectancies of less than years, in fact, it is not uncommon for them to be less than !
-
The most populated countries tend to belong to Asia, have a small GDP per capita and their life expectancies range from to years.
-
Countries in Europe and America tend to have smaller populations and their GDP per capita exceeds $K and can reach almost $K. Life expectancy is more stable here, from to years.
-
Although the patterns are clear, there are still atypical countries that move away from the point cloud. This means that some places have large GDP per capita but a life expectancy of less than years.
-
Keep investigating! What other observations can you make with the diagram?
Conclusion
You have just added a new tool to your visualization skills, recall that:
- Scatter plots provide a clear and concise representation of data points, enabling you to identify patterns, trends, correlations and outliers.
- Data can be related in a variety of ways, from something as simple as a line to something as exotic as a logarithm.
- By interpreting scatter plots, you can make informed decisions and draw meaningful insights from the data.
- You can always add other variables to enrich your diagram, this can help you improve your analysis and obtain compelling conclusions.