8 minutes read

As a data scientist, you might want to create a dataset for your project that has a date and time value or visualize information from your dataset over a period of time. In this topic, you will learn about the basics of working with time-series data in pandas , how to convert strings to the datetime type, and how to assemble a date from multiple columns. You will also learn about extracting datetime from various columns and selecting data using a specific year or date range.

Getting started

Suppose we have a hotel and we are creating a DataFrame with the check-in date and the number of guests that visited the hotel:

import pandas as pd
import random

df = pd.DataFrame ({'date': ['2022-06-01', '2022-06-02', '2022-06-03'],
                    'guest': [5, 10, 20]})

Our DataFrame looks like this:

         date  guest
0  2022-06-01      5
1  2022-06-02     10
2  2022-06-03     20

Let's check the data type of the hotel DataFrame we just created by calling df.dtypes:

date     object
guest     int64
dtype: object

We can see that the date column is recognized as object (a Python object, str). We can change the column type with pd.to_datetime():

df['date'] = pd.to_datetime(df['date'])

Let's run df.dtypes once again:

date     datetime64[ns]
guest             int64
dtype: object

You can see the data type for the date isdatetime64[ns] ([ns] stands for nanosecond-based time format), which shows that it is now recognized as a pandas datetime data type.

Datetime format

By default, to_datetime() will parse to YYYY-MM-DD(everything recognized as a date will be cast to that format). If we want to preserve that the input has the day before the month, dayfirst has to be set to True:

df = pd.DataFrame({'date': ['7/06/2022', '8/06/2022'], #DD/MM/YYYY format
                  'guest': [5, 23]})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)

This is the DataFrame now:

        date  guest
0 2022-06-07      5
1 2022-06-08     23

When the date is in the YY-MM-DD format (such as '13/09/12'), there is a similar argument, yearfirst, which, if set to True, will output2013-09-12.

Imagine we had an update in the check-in system, and we had to log the last guest's arrival time:

df = pd.DataFrame({'date': ['7/06/2022 5:35:00', '8/06/2022 19:45:30'],
                  'guest': [5, 23]})
df['date'] = pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S")

Passing a string of format codes to format is the general way to ensure the correct parsing:

                 date  guest
0 2022-06-07 05:35:00      5
1 2022-06-08 19:45:30     23

If you want to represent the datetime dates in a way other than YYYY-MM-DD, use the dt.strftime()method.

dt.strftime() will change the column type to object(a Python string)

Assembling a date from various columns

Let's say we have separate columns with the year, month, and day, and want to create a column with the complete date. With the to_datetime() function, you can do that as well:

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
df['date'] = pd.to_datetime(df)

The DataFrame with an assembled date column now looks like this:

   year  month  day       date
0  2015      2    4 2015-02-04
1  2016      3    5 2016-03-05

Note that for this method of date assembly to work, the names of the DataFrame columns must be common abbreviations ('year', 'month', 'day', and so on) or their plurals.

Extracting additional features

In a situation where we want to get a separate component from the date column, there are several DatetimeIndex attributes ranging from dt.year to dt.nanosecond(or less obvious ones, such asdt.dayofweek or dt.is_leap_year):

df = pd.DataFrame({'date': ['4/03/2022', '5/03/2022'],
                  'guest': [16, 19]})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['year']= df['date'].dt.year
df['month']= df['date'].dt.month
df['day']= df['date'].dt.day

Our DataFrame will be as follows:

        date  guest  year  month  day
0 2022-03-04     16  2022      3    4
1 2022-03-05     19  2022      3    5

Finding out the difference between two dates

Getting back to the hotel check-in system, let's say we want to get an insight into how early people are planning their trips at different times of the year. The DataFrame looks like this:

  booking_id        booking_date        checkin_date
0   eLW3L2p3 2022-07-01 18:29:23 2022-07-16 05:52:51
1   8qgC2z6q 2022-05-16 15:09:13 2022-02-06 08:01:42
2   HVoJA4TT 2022-02-15 17:39:10 2022-04-30 02:53:20

We can obtain the difference between the two dates in various units, let's say the number of days suits the purpose:

df['diff']= (df['checkin_date'] - df['booking_date']).dt.days

Output:

  booking_id        booking_date        checkin_date  diff
0   eLW3L2p3 2022-07-01 18:29:23 2022-07-16 05:52:51    14
1   8qgC2z6q 2022-05-16 15:09:13 2022-06-02 08:01:42    16
2   HVoJA4TT 2022-02-15 17:39:10 2022-04-30 02:53:20    73

Filtering data within a certain time range

Let's create a synthetic dataset and select the slice over a three-day span. First, we generate the dataset:

df = pd.DataFrame({'guest': random.sample(range(10, 30), 7)})
df['date'] = pd.date_range(start='2023-01-02', end='2023-01-08')

Here, we generated a list of guest counts and specified the start and the end dates for the range. To select the data between two dates with .loc, we have to set the index on the datecolumn, and provide the start and the end of the range to .loc:

df = df.set_index(['date'])
df.loc['2023-01-05':'2023-01-07']

That will result in a similar DataFrame (ignoring the randomly generated guest counts):

            guest
date             
2023-01-05     28
2023-01-06     22
2023-01-07     16

You can also use a boolean mask with .loc to have a slice of the DataFrame between the specified dates, but we won't be discussing it in this topic.

Conclusion

  • You can use the to_datetime() function to convert a DataFrame column to pandas datetime.
  • You can use df.loc[...] to either select data between dates or for a specific date.
  • pd.date_range() helps to create a date sequence.
  • To extract the specific datetime components, use dt.year, dt.month, dt.day , and so on.
  • You can subtract datetime columns to get the time interval between two dates
4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo