Working with datetimes
Datetimes can be a very useful datatype to work with. We can use crandas to deal with date and datetime data in a way where datetimes that were uploaded with different formats can still be correctly compared. In this section we will learn how to upload dates and datetimes to the engine and the datetime-specific functions we have that allow us to extract properties from the datetime data that we have.
Uploading Datetimes
When uploading datetimes as strings we have to be more explicit than when
uploading integers or strings. In order for crandas to know that the
column that you are trying to upload contains datetimes, you must specify
the datetime ctype.
import crandas as cd
import pandas as pd
# The datetimes are in ISO 8601 format, but formats accepted by strptime are accepted
datetimes = ["1991-02-04T09:36:55.100",
"2023-08-30T21:15:23.550",
"2023-09-06T12:15:55.830"]
# We need to add the ctype or it won't work
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime"})
Note
Many standard crandas operations such as the group-by and the join also
work normally for datetime columns (with some exceptions covered here). This includes the aggregates min() and max(), which report the
earliest and latest datetime respectively. On the other hand, aggregates
such as sum() are ill-defined for datetimes and cannot be used.
There are certain data formats that are ambiguous. For example, it is
not clear whether "01/02/2000" refers to the first of February or the
second of January. When using ambiguous formats, you can add the
following format specification to the ctype:
# We need to add the ctype or it won't work
tab = cd.DataFrame(
{"ambiguous_datetimes": ["01/02/2000 15:12:00", "02/04/1991 20:00:55", "03/12/1993 05:30:30"]},
ctype={"ambiguous_datetimes": "datetime[format=%d/%m/%Y %H:%M:%S]"},
)
The accepted options can be found here.
Python has multiple ways to represent datetimes, including datetime.datetime,
pandas.Timestamp and numpy.datetime64. When uploading an existing
DataFrame with one of these representations, the resulting
DataFrames will have the
relevant column as a datetime.
pydatetimes = pd.to_datetime(
["19910204T15:00:00", "2023-08-10T11:00:00", "2023-09-06 21:15:00",
"March 14, 2016 09:15:50", "19010101T10:15:30", "20991231T23:59:59"],
format="mixed",
)
tab = cd.DataFrame({"datetimes": pydatetimes})
In some cases, it is more natural to work with dates rather than datetimes,
for instance if only the date is known. For this purpose, crandas also offers a
dedicated date type. Compared to datetimes, dates in crandas are more efficient to work with, are simpler to handle, and also support arithmetic. When uploading dates as strings, it is necessary to specify the
date ctype.
# The dates are in ISO 8601 format, but formats accepted by strptime are accepted
dates = ["19910204", "20230830", "20230906", "20160314"]
# We need to add the ctype or it won't work
tab = cd.DataFrame({"dates" : dates}, ctype={"dates": "date"})
Similar to datetimes, also the date column allows for a custom format to be
specified as date[format=...].
Warning
Due to technical limitations, crandas can only handle datetimes/dates between 1901-01-01 and 2099-12-31.
Datetime resolution
Crandas allows for datetimes to be stored with configurable resolution. By default, datetimes are stored with nanosecond precision. However, such a high resolution is not necessary for all applications, nor are datetime measurements with such a resolution always available. Crandas allows for a different resolution to be specified through the ctype:
datetimes = ["1991-02-04T09:36:55.492493",
"2023-08-30T21:15:23.593294",
"2023-09-06T12:15:55.149291"]
# By specifying resolution=us, the datetime column will have microsecond resolution
tab = cd.DataFrame({"dt" : datetimes}, ctype={"dt": "datetime[resolution=us]"})
The supported resolutions are s, ms, us and ns. Choosing a coarser
resolution can improve the performance of operations involving datetime columns.
Timezones
In certain applications, it is relevant to keep track of the timezone information associated with a datetime. For this purpose, crandas allows for UTC offsets to be specified for datetimes. The UTC offset is the difference in hours and minutes from UTC.
A datetime column which has UTC offsets is referred to as timezone-aware, and can be created as follows:
datetimes = ["1991-02-04T09:36:55.100+02:00", # UTC+02:00
"2023-08-30T21:15:23.550-03:15", # UTC-03:15
"2023-09-06T12:15:55.830+05:00"] # UTC+05:00
tab = cd.DataFrame({"dt" : datetimes}, ctype={"dt": "datetime"})
A datetime column without such timezone information, such as in the other examples in this user guide, is referred to as timezone-naive.
Warning
Group-by and join operations are only supported for timezone-naive datetimes. This is due to the fact that datetimes with a different UTC offset can nevertheless reflect the same moment in time and would therefore be matched, which could be perceived as unexpected.
Date/time component functions
Given a column of dates/datetimes, we might be interested in which of them happen in the same year or month, or at the same hour. Crandas contains functions that allow the extraction of year, month, day, day of the year, weekday (starting on Monday) along with the hour, minute, second, microsecond, nanosecond component of the datetime.
Following the pandas convention, these can be accessed through tab["datetimes"].dt
as follows. For the date units:
datetimes = ["1991-02-04T09:36:55.100",
"2023-08-30T21:15:23.550",
"2023-09-06T12:15:55.830"]
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime"})
date_units = tab[:]
date_units["year"] = tab["datetimes"].dt.year
date_units["month"] = tab["datetimes"].dt.month
date_units["day"] = tab["datetimes"].dt.day
date_units["weekday"] = tab["datetimes"].dt.weekday
date_units["day_of_year"] = tab["datetimes"].dt.day_of_year
>>> date_units.open()
datetimes year month day weekday day_of_year
0 1991-02-04 09:36:55.100 1991 2 4 0 35
1 2023-08-30 21:15:23.550 2023 8 30 2 242
2 2023-09-06 12:15:55.830 2023 9 6 2 249
In a similar way, the time units can be accessed as:
time_units = tab[:]
time_units["hour"] = tab["datetimes"].dt.hour
time_units["minute"] = tab["datetimes"].dt.minute
time_units["second"] = tab["datetimes"].dt.second
time_units["microsecond"] = tab["datetimes"].dt.microsecond
time_units["nanosecond"] = tab["datetimes"].dt.nanosecond
>>> time_units.open()
datetimes hour minute second microsecond nanosecond
0 1991-02-04 09:36:55.100 9 36 55 100000 0
1 2023-08-30 21:15:23.550 21 15 23 550000 0
2 2023-09-06 12:15:55.830 12 15 55 830000 0
When using these operations, the resulting column is an integer column and no longer has a datetime format, but acts like any other integer column.
Tip
If the datetime column is timezone-aware, then the returned date/time components reflect the local time.
Note
Following the convention of pandas, the millisecond component is contained
within the microseconds as returned by tab["datetimes"].dt.microsecond.
Therefore, there is no separate millisecond property.
In addition to extracting integer components, it is also possible to extract
the date from a datetime column. In contrast to the other extraction methods,
the result is a crandas date column.
Filtering through datetimes
We can also use crandas to filter over a table, we can choose a specific datetime range or choose any datetime after a particular one. Datetime columns can be treated as any column type, as long as they are compared to python/pandas datetime objects.
# Upload datetimes with millisecond resolution
datetimes = ["1991-02-04T09:36:55.492",
"2023-08-30T21:15:23.593",
"2023-09-06T12:15:55.149"]
tab = cd.DataFrame({"datetimes" : datetimes}, ctype={"datetimes": "datetime[resolution=ms]"})
# Specify datetime filter also with millisecond resolution
filter_datetime = pd.Timestamp("2016-03-14 15:00:00.000")
filtered = tab[tab["datetimes"] >= filter_datetime]
Tip
For comparisons involving datetime columns, using the same datetime resolution is recommended.
If resolutions differ, crandas will emit a DatetimeImpreciseOperationWarning; consider aligning resolutions
(e.g., via tab["datetimes"] = tab["datetimes"].astype("datetime[resolution=ms]")) to avoid ambiguity.
Calculations over dates
We can also use crandas to find how many days happened between two dates or finding the date after a determined amount of days. To do this, we can simply add and subtract columns as normal.
dates = ["19910204", "30/08/23", "2023-09-06", "March 14, 2016"]
tab = cd.DataFrame(
{"dates": dates, "days_for_next_app": [12, 7, 20, 60]}, ctype={"dates": "date"}
)
tab["next_app"] = tab["dates"] + tab["days_for_next_app"]
tab["day_diff"] = tab["next_app"] - tab["dates"]
>>> tab.open()
dates days_for_next_app next_app day_diff
0 1991-02-04 12 1991-02-16 12
1 2023-08-30 7 2023-09-06 7
2 2023-09-06 20 2023-09-26 20
3 2016-03-14 60 2016-05-13 60
Of course, some operations are not allowed, like adding two dates together or subtracting a date from an integer.
We can also add an integer directly or a python date:
In crandas, we can only add and subtract days. It is not possible to add months or years.Warning
Calculations are currently only supported for date columns, not datetime
columns. To extract a date column from a datetime column, use tab["date"] = tab["datetimes"].dt.date and perform the calculations on the resulting date column.
After learning about datetimes, we will see what happens when we are working with missing data in the next section.