Working with missing data

Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this.

Uploading data with missing values

Uploading data with missing values does not always work automatically. The main issue is that by default, data is considered to be always present (not null). To allow columns to have missing values, it should be manually set to nullable using the crandas.ctypes system.

For illustration, assume we have a CSV file called test.csv with the following data:

name,age,city
Alice,25,
Bob,,New York
Charlie,30,Los Angeles

The following will not work:

>>> cd.read_csv("test.csv")
[...]
NotImplementedError: FixedPoint ctype does not support nullable columns yet

The above exception was the direct cause of the following exception:
[...]
SeriesEncodingError: Could not encode column age

The problem is that the age column gets interpreted as a fixed-point column, which does not support missing values.

We need to manually set the ctypes. For age an appropriate ctype would be uint8? – this is shorthand for int[nullable,min=0,max=255].

Note

The ? after the data type indicates that there are missing/null values in that column.

>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12)])

Missing string values

Currently, the VDL does not support a nullable string type. However, in most cases, this is not needed. Often, like in the case of csv files, an empty string is used to represent a missing string entry.

Added in version 1.5: The VDL converts any null value in a string column to an empty string "". While this allows for many null value functionalities, be mindful that certain things works differently than expected.

When looking for null values, you must use =="" instead of the CSeries.isna() or CSeries.notna() functions. These functions will output all zeroes and ones respectively, as the data is not seen as null by the VDL. Additionally, when doing an inner join with string data, empty strings will be matched together.

Warning

When reading a csv, pandas automatically assumes that the sting "NA" represents a missing value in a string column. If you are uploading data through crandas and it contains that string, be aware that it will be converted into ane empty string. Consider modifying it and then using CSeries.if_else() to convert the data back.

Finding missing values

Once we have uploaded the data, we want to ensure that the data we are accessing is not missing. For that, we have two methods to check whether values of a column in the CDataFrame are missing:

One is the negation of the other and they output a column (a CSeriesFun) with zeroes and ones representing whether the values are null or not.

# This returns a one when the respective values are NULL, zero otherwise
missing_age = data["age"].isna()

# Returns a one when the respective values are not NULL
not_missing_age = data["age"].notna()