As a data scientist, you might want to create a dataset for your project that has a date and time value or visualize information from your dataset over a period of time. In this topic, you will learn about the basics of working with time-series data in pandas , how to convert strings to the datetime type, and how to assemble a date from multiple columns. You will also learn about extracting datetime from various columns and selecting data using a specific year or date range.
Getting started
Suppose we have a hotel and we are creating a DataFrame with the check-in date and the number of guests that visited the hotel:
import pandas as pd
import random
df = pd.DataFrame ({'date': ['2022-06-01', '2022-06-02', '2022-06-03'],
'guest': [5, 10, 20]})
Our DataFrame looks like this:
date guest
0 2022-06-01 5
1 2022-06-02 10
2 2022-06-03 20
Let's check the data type of the hotel DataFrame we just created by calling df.dtypes:
date object
guest int64
dtype: object
We can see that the date column is recognized as object (a Python object, str). We can change the column type with pd.to_datetime():
df['date'] = pd.to_datetime(df['date'])
Let's run df.dtypes once again:
date datetime64[ns]
guest int64
dtype: object
You can see the data type for the date isdatetime64[ns] ([ns] stands for nanosecond-based time format), which shows that it is now recognized as a pandas datetime data type.
Datetime format
By default, to_datetime() will parse to YYYY-MM-DD(everything recognized as a date will be cast to that format). If we want to preserve that the input has the day before the month, dayfirst has to be set to True:
df = pd.DataFrame({'date': ['7/06/2022', '8/06/2022'], #DD/MM/YYYY format
'guest': [5, 23]})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
This is the DataFrame now:
date guest
0 2022-06-07 5
1 2022-06-08 23
When the date is in the YY-MM-DD format (such as '13/09/12'), there is a similar argument, yearfirst, which, if set to True, will output2013-09-12.
Imagine we had an update in the check-in system, and we had to log the last guest's arrival time:
df = pd.DataFrame({'date': ['7/06/2022 5:35:00', '8/06/2022 19:45:30'],
'guest': [5, 23]})
df['date'] = pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S")
Passing a string of format codes to format is the general way to ensure the correct parsing:
date guest
0 2022-06-07 05:35:00 5
1 2022-06-08 19:45:30 23
If you want to represent the datetime dates in a way other than YYYY-MM-DD, use the dt.strftime()method.
dt.strftime() will change the column type to object(a Python string) Assembling a date from various columns
Let's say we have separate columns with the year, month, and day, and want to create a column with the complete date. With the to_datetime() function, you can do that as well:
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]})
df['date'] = pd.to_datetime(df)
The DataFrame with an assembled date column now looks like this:
year month day date
0 2015 2 4 2015-02-04
1 2016 3 5 2016-03-05
Note that for this method of date assembly to work, the names of the DataFrame columns must be common abbreviations ('year', 'month', 'day', and so on) or their plurals.
Extracting additional features
In a situation where we want to get a separate component from the date column, there are several DatetimeIndex attributes ranging from dt.year to dt.nanosecond(or less obvious ones, such asdt.dayofweek or dt.is_leap_year):
df = pd.DataFrame({'date': ['4/03/2022', '5/03/2022'],
'guest': [16, 19]})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['year']= df['date'].dt.year
df['month']= df['date'].dt.month
df['day']= df['date'].dt.day
Our DataFrame will be as follows:
date guest year month day
0 2022-03-04 16 2022 3 4
1 2022-03-05 19 2022 3 5Finding out the difference between two dates
Getting back to the hotel check-in system, let's say we want to get an insight into how early people are planning their trips at different times of the year. The DataFrame looks like this:
booking_id booking_date checkin_date
0 eLW3L2p3 2022-07-01 18:29:23 2022-07-16 05:52:51
1 8qgC2z6q 2022-05-16 15:09:13 2022-02-06 08:01:42
2 HVoJA4TT 2022-02-15 17:39:10 2022-04-30 02:53:20
We can obtain the difference between the two dates in various units, let's say the number of days suits the purpose:
df['diff']= (df['checkin_date'] - df['booking_date']).dt.days
Output:
booking_id booking_date checkin_date diff
0 eLW3L2p3 2022-07-01 18:29:23 2022-07-16 05:52:51 14
1 8qgC2z6q 2022-05-16 15:09:13 2022-06-02 08:01:42 16
2 HVoJA4TT 2022-02-15 17:39:10 2022-04-30 02:53:20 73Filtering data within a certain time range
Let's create a synthetic dataset and select the slice over a three-day span. First, we generate the dataset:
df = pd.DataFrame({'guest': random.sample(range(10, 30), 7)})
df['date'] = pd.date_range(start='2023-01-02', end='2023-01-08')
Here, we generated a list of guest counts and specified the start and the end dates for the range. To select the data between two dates with .loc, we have to set the index on the datecolumn, and provide the start and the end of the range to .loc:
df = df.set_index(['date'])
df.loc['2023-01-05':'2023-01-07']
That will result in a similar DataFrame (ignoring the randomly generated guest counts):
guest
date
2023-01-05 28
2023-01-06 22
2023-01-07 16
You can also use a boolean mask with .loc to have a slice of the DataFrame between the specified dates, but we won't be discussing it in this topic.
Conclusion
- You can use the
to_datetime()function to convert aDataFramecolumn topandasdatetime. - You can use
df.loc[...]to either select data between dates or for a specific date. -
pd.date_range()helps to create a date sequence. - To extract the specific
datetimecomponents, usedt.year,dt.month,dt.day, and so on. - You can subtract
datetimecolumns to get the time interval between two dates