Computer scienceData scienceMachine learningReinforcement learning

Time series data preprocessing

4 minutes read

In this topic, we'll cover key preprocessing techniques for time series data. We'll learn about resampling and interpolation, removing trends and seasonal patterns, smoothing methods, and checking for stationarity. These techniques will help you prepare your time series data for accurate analysis and modeling.

Resampling and interpolation

Resampling changes how often data points occur in a time series. It's useful when you need to match data from different sources or examine data at different time scales. For example, you might want to convert daily data to weekly or monthly data. Interpolation is used to fill in missing values in a time series.

Resampling helps you compare data recorded at different intervals. For instance, if you're looking at sales data and have daily records for some products but weekly records for others, you'd need to resample the daily data to weekly to compare them properly. Interpolation helps keep your data complete when there are gaps:

import pandas as pd
import numpy as np

# Create a sample time series
dates = pd.date_range(start='2023-01-01', end='2023-01-31', freq='D')
data = pd.Series(np.random.randn(31), index=dates)

# Resample to weekly frequency
weekly_data = data.resample('W').mean()

# Interpolate missing values
data_with_gaps = data.copy()
data_with_gaps[['2023-01-05', '2023-01-15', '2023-01-25']] = np.nan
interpolated_data = data_with_gaps.interpolate()

print("Original data:\n", data.head())
print("\nWeekly resampled data:\n", weekly_data)
print("\nInterpolated data:\n", interpolated_data.head())

In this example, we first resample the daily data to weekly using the .resample() method. Then, we create gaps in the data and use the .interpolate() method to fill in the missing values. By default, this uses linear interpolation, but you can also use other methods like polynomial or spline interpolation for more complex data.

There are a few caveats to keep in mind. Resampling to a lower frequency (downsampling) can result in a loss of detail, while resampling to a higher frequency (upsampling) might add false precision. Interpolation can introduce bias, especially if there are long gaps in the data or if the missing values aren't random.

Removing trends and seasonal patterns

Removing trends and seasonal patterns involves taking out long-term changes and repeating patterns from time series data. This process helps isolate other patterns and make the data stationary, which is a requirement for many time series analysis methods.

This is important because it allows you to focus on other factors affecting your data. For example, if you're analyzing a store's sales data, it might show an upward trend due to business growth and peaks during holidays. By removing these parts, you can identify other factors affecting sales and potentially uncover hidden patterns.

There are different ways to remove trends and seasonal patterns. One common method is decomposition, which can be additive (where components are added together) or multiplicative (where components are multiplied):

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose

# Create a sample time series with trend and seasonality
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
trend = np.linspace(0, 10, len(dates))
seasonality = np.sin(np.arange(len(dates)) * 2 * np.pi / 365)
noise = np.random.normal(0, 1, len(dates))
data = pd.Series(trend + seasonality + noise, index=dates)

# Perform seasonal decomposition
result = seasonal_decompose(data, model='additive', period=365)

# Remove trend and seasonal patterns
detrended = data - result.trend
deseasonalized = data - result.seasonal

print("Original data:\n", data.head())
print("\nData without trend:\n", detrended.head())
print("\nData without seasonal pattern:\n", deseasonalized.head())

In this example, we use the seasonal_decompose() function to split the time series into trend, seasonal, and leftover parts. We then subtract the trend and seasonal parts to get data without these patterns. This helps reveal patterns that might be hidden by long-term trends or seasonal changes.

Other methods for removing trends include differencing (subtracting each value from the previous one) and fitting and subtracting a regression line. The best method depends on your specific data and analysis goals.

Smoothing methods

Smoothing methods are used to reduce noise and clarify patterns. These techniques help identify trends that might be hidden by short-term changes or random noise. Smoothing is particularly useful when focusing on long-term trends or preparing data for forecasting.

For example, daily stock prices can fluctuate significantly. By smoothing the data, you can more easily see the overall direction the stock is moving.

Let's look at two common smoothing methods: Moving Average and Exponential Smoothing.

import pandas as pd
import numpy as np
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Create a sample time series with noise
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
trend = np.linspace(0, 10, len(dates))
noise = np.random.normal(0, 1, len(dates))
data = pd.Series(trend + noise, index=dates)

# Apply Simple Moving Average
window_size = 7
moving_average = data.rolling(window=window_size).mean()

# Apply Exponential Smoothing
exp_smoothing = ExponentialSmoothing(data, trend='add').fit().fittedvalues

print("Original data:\n", data.head())
print("\nMoving Average (window=7):\n", moving_average.head())
print("\nExponential Smoothing:\n", exp_smoothing.head())

In this example, we first use a Simple Moving Average with a 7-day window. This method calculates the average of the last 7 days for each point. Then, we apply Exponential Smoothing, which gives more importance to recent data points.

Moving averages are simple to understand and work well for data without strong trends. Exponential smoothing is good for data with trends and can adapt more quickly to changes. Other smoothing methods include LOESS (Locally Estimated Scatterplot Smoothing) and the Savitzky-Golay filter, both of which can handle more complex patterns.

Checking if data is stationary

A stationary time series has statistical properties that don't change over time, like its average and spread. Many time series models assume the data is stationary, so it's important to check for stationarity before using these models.

Non-stationary data can lead to wrong results in many statistical analyses. For example, two unrelated non-stationary series might look related just because they both have a trend, which could lead to incorrect conclusions.

There are several tests to check for stationarity. One common method is the Augmented Dickey-Fuller (ADF) test:

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

# Create a sample non-stationary time series
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
trend = np.linspace(0, 10, len(dates))
noise = np.random.normal(0, 1, len(dates))
non_stationary_data = pd.Series(trend + noise, index=dates)

# Create a sample stationary time series
stationary_data = pd.Series(noise, index=dates)

def check_stationarity(timeseries):
    result = adfuller(timeseries, autolag='AIC')
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value}')
    
    if result[1] <= 0.05:
        print("The time series is stationary")
    else:
        print("The time series is not stationary")

print("Non-stationary data test:")
check_stationarity(non_stationary_data)

print("\nStationary data test:")
check_stationarity(stationary_data)

In this example, we use the adfuller() function to do the ADF test on both non-stationary and stationary time series. The test gives a p-value, which we use to determine if the series is stationary. A p-value less than 0.05 suggests that the time series is stationary.

Other tests for stationarity include the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test and the Phillips-Perron test. If your data isn't stationary, you can often make it stationary by differencing (subtracting each value from the previous one) or by removing trends and seasonal patterns.

Conclusion

In this topic, we've covered several important techniques:

Resampling and interpolation for changing data frequency and filling in missing values.
Removing trends and seasonal patterns to focus on other factors.
Smoothing techniques to reduce noise and find underlying patterns.
Checking if data is stationary to meet the requirements of many time series models.

These preprocessing steps work together to ensure your time series data is clean, consistent, and ready for various analysis techniques. In a typical workflow, you might start by resampling your data to the appropriate frequency, then interpolate any missing values. Next, you could remove trends and seasonal patterns if needed. Smoothing can help clarify patterns, and finally, you'd check if your data is stationary and make it stationary if it's not.

How did you like the theory?

Report a typo