Working with datetimes

Datetimes can be a very useful datatype to work with. We can use crandas to deal with date and datetime data in a way where datetimes that were uploaded with different formats can still be correctly compared. In this section we will learn how to upload dates and datetimes to the engine and the datetime-specific functions we have that allow us to extract properties from the datetime data that we have.

Uploading Datetimes

When uploading datetimes as strings we have to be more explicit than when uploading integers or strings. In order for crandas to know that the column that you are trying to upload contains datetimes, you must specify the datetime ctype.

import crandas as cd
import pandas as pd

# The datetimes are in ISO 8601 format, but formats accepted by strptime are accepted
datetimes = ["1991-02-04T09:36:55.100", 
            "2023-08-30T21:15:23.550",
            "2023-09-06T12:15:55.830"]
# We need to add the ctype or it won't work
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime"})

Note

Many standard crandas operations such as the group-by and the join also work normally for datetime columns (with some exceptions covered here). This includes the aggregates min() and max(), which report the earliest and latest datetime respectively. On the other hand, aggregates such as sum() are ill-defined for datetimes and cannot be used.

There are certain data formats that are ambiguous. For example, it is not clear whether "01/02/2000" refers to the first of February or the second of January. When using ambiguous formats, you can add the following format specification to the ctype:

# We need to add the ctype or it won't work
tab = cd.DataFrame(
    {"ambiguous_datetimes": ["01/02/2000 15:12:00", "02/04/1991 20:00:55", "03/12/1993 05:30:30"]},
    ctype={"ambiguous_datetimes": "datetime[format=%d/%m/%Y %H:%M:%S]"},
)

The accepted options can be found here.

Python has multiple ways to represent datetimes, including datetime.datetime, pandas.Timestamp and numpy.datetime64. When uploading an existing DataFrame with one of these representations, the resulting DataFrames will have the relevant column as a datetime.

pydatetimes = pd.to_datetime(
    ["19910204T15:00:00", "2023-08-10T11:00:00", "2023-09-06 21:15:00", 
    "March 14, 2016 09:15:50", "19010101T10:15:30", "20991231T23:59:59"],
    format="mixed",
)
tab = cd.DataFrame({"datetimes": pydatetimes})

In some cases, it is more natural to work with dates rather than datetimes, for instance if only the date is known. For this purpose, crandas also offers a dedicated date type. Compared to datetimes, dates in crandas are more efficient to work with, are simpler to handle, and also support arithmetic. When uploading dates as strings, it is necessary to specify the date ctype.

# The dates are in ISO 8601 format, but formats accepted by strptime are accepted
dates = ["19910204", "20230830", "20230906", "20160314"]
# We need to add the ctype or it won't work
tab =  cd.DataFrame({"dates" : dates}, ctype={"dates": "date"})

Similar to datetimes, also the date column allows for a custom format to be specified as date[format=...].

Warning

Due to technical limitations, crandas can only handle datetimes/dates between 1901-01-01 and 2099-12-31.

Datetime resolution

Crandas allows for datetimes to be stored with configurable resolution. By default, datetimes are stored with nanosecond precision. However, such a high resolution is not necessary for all applications, nor are datetime measurements with such a resolution always available. Crandas allows for a different resolution to be specified through the ctype:

datetimes = ["1991-02-04T09:36:55.492493",
            "2023-08-30T21:15:23.593294",       
            "2023-09-06T12:15:55.149291"]

# By specifying resolution=us, the datetime column will have microsecond resolution
tab =  cd.DataFrame({"dt" : datetimes}, ctype={"dt": "datetime[resolution=us]"})

The supported resolutions are s, ms, us and ns. Choosing a coarser resolution can improve the performance of operations involving datetime columns.

Timezones

In certain applications, it is relevant to keep track of the timezone information associated with a datetime. For this purpose, crandas allows for UTC offsets to be specified for datetimes. The UTC offset is the difference in hours and minutes from UTC.

A datetime column which has UTC offsets is referred to as timezone-aware, and can be created as follows:

datetimes = ["1991-02-04T09:36:55.100+02:00",   # UTC+02:00
            "2023-08-30T21:15:23.550-03:15",    # UTC-03:15
            "2023-09-06T12:15:55.830+05:00"]    # UTC+05:00
tab =  cd.DataFrame({"dt" : datetimes}, ctype={"dt": "datetime"})

A datetime column without such timezone information, such as in the other examples in this user guide, is referred to as timezone-naive.

Warning

Group-by and join operations are only supported for timezone-naive datetimes. This is due to the fact that datetimes with a different UTC offset can nevertheless reflect the same moment in time and would therefore be matched, which could be perceived as unexpected.

Date/time component functions

Given a column of dates/datetimes, we might be interested in which of them happen in the same year or month, or at the same hour. Crandas contains functions that allow the extraction of year, month, day, day of the year, weekday (starting on Monday) along with the hour, minute, second, microsecond, nanosecond component of the datetime.

Following the pandas convention, these can be accessed through tab["datetimes"].dt as follows. For the date units:

datetimes = ["1991-02-04T09:36:55.100", 
            "2023-08-30T21:15:23.550",
            "2023-09-06T12:15:55.830"]
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime"})

date_units = tab[:]
date_units["year"] = tab["datetimes"].dt.year
date_units["month"] = tab["datetimes"].dt.month
date_units["day"] = tab["datetimes"].dt.day
date_units["weekday"] = tab["datetimes"].dt.weekday
date_units["day_of_year"] = tab["datetimes"].dt.day_of_year

>>> date_units.open()
    datetimes                   year    month   day     weekday day_of_year     
0   1991-02-04 09:36:55.100     1991    2       4       0       35
1   2023-08-30 21:15:23.550     2023    8       30      2       242
2   2023-09-06 12:15:55.830     2023    9       6       2       249

In a similar way, the time units can be accessed as:

time_units = tab[:]
time_units["hour"] = tab["datetimes"].dt.hour
time_units["minute"] = tab["datetimes"].dt.minute
time_units["second"] = tab["datetimes"].dt.second
time_units["microsecond"] = tab["datetimes"].dt.microsecond
time_units["nanosecond"] = tab["datetimes"].dt.nanosecond

>>> time_units.open()
    datetimes                   hour    minute  second  microsecond     nanosecond
0   1991-02-04 09:36:55.100     9       36      55      100000          0
1   2023-08-30 21:15:23.550     21      15      23      550000          0
2   2023-09-06 12:15:55.830     12      15      55      830000          0

When using these operations, the resulting column is an integer column and no longer has a datetime format, but acts like any other integer column.

Tip

If the datetime column is timezone-aware, then the returned date/time components reflect the local time.

Note

Following the convention of pandas, the millisecond component is contained within the microseconds as returned by tab["datetimes"].dt.microsecond. Therefore, there is no separate millisecond property.

In addition to extracting integer components, it is also possible to extract the date from a datetime column. In contrast to the other extraction methods, the result is a crandas date column.

date = tab["datetimes"].dt.date

>>> date.open()
0   1991-02-04
1   2023-08-30
2   2023-09-06

Filtering through datetimes

We can also use crandas to filter over a table, we can choose a specific datetime range or choose any datetime after a particular one. Datetime columns can be treated as any column type, as long as they are compared to python/pandas datetime objects.

# Upload datetimes with millisecond resolution
datetimes = ["1991-02-04T09:36:55.492",
            "2023-08-30T21:15:23.593",       
            "2023-09-06T12:15:55.149"]
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime[resolution=ms]"})

# Specify datetime filter also with millisecond resolution
filter_datetime = pd.Timestamp("2016-03-14 15:00:00.000")

filtered = tab[tab["datetimes"] >= filter_datetime]

>>> filtered.open()
    datetimes
0   2023-08-30 21:15:23.593
1   2023-09-06 12:15:55.149

Tip

For comparisons involving datetime columns, using the same datetime resolution is recommended. If resolutions differ, crandas will emit a DatetimeImpreciseOperationWarning; consider aligning resolutions (e.g., via tab["datetimes"] = tab["datetimes"].astype("datetime[resolution=ms]")) to avoid ambiguity.

Calculations over dates

We can also use crandas to find how many days happened between two dates or finding the date after a determined amount of days. To do this, we can simply add and subtract columns as normal.

dates = ["19910204", "30/08/23", "2023-09-06", "March 14, 2016"]
tab = cd.DataFrame(
    {"dates": dates, "days_for_next_app": [12, 7, 20, 60]}, ctype={"dates": "date"}
)

tab["next_app"] = tab["dates"] + tab["days_for_next_app"]
tab["day_diff"] = tab["next_app"] - tab["dates"]

>>> tab.open()
dates  days_for_next_app   next_app  day_diff
0 1991-02-04                 12 1991-02-16        12
1 2023-08-30                  7 2023-09-06         7
2 2023-09-06                 20 2023-09-26        20
3 2016-03-14                 60 2016-05-13        60

Of course, some operations are not allowed, like adding two dates together or subtracting a date from an integer.

We can also add an integer directly or a python date:

>>> (tab["dates"]+30).open()
0 1991-03-06
1 2023-09-29
2 2023-10-06
3 2016-04-13

>>> (pd.to_datetime("20230906").date() - tab["dates"]).open()
0  11902
1      7
2      0
3   2732

In crandas, we can only add and subtract days. It is not possible to add months or years.

Warning

Calculations are currently only supported for date columns, not datetime columns. To extract a date column from a datetime column, use tab["date"] = tab["datetimes"].dt.date and perform the calculations on the resulting date column.

After learning about datetimes, we will see what happens when we are working with missing data in the next section.