Working with missing data¶
Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this.
Added in version 1.8: Nullable columns for all data types.
Uploading data with missing values¶
crandas supports uploading data with missing values automatically. In some cases, customization is needed by manually setting a column type to nullable using the crandas.ctypes system.
For illustration, assume we have a CSV file called test.csv
with the following data:
name,age,city
Alice,25,
Bob,,New York
Charlie,30,Los Angeles
crandas can upload this data, but customization is probably desired here:
>>> cd.read_csv("test.csv")
Handle: ...
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "fp", 1, nullable=True), Col("city", "s", 12, nullable=True)])
As can be seen from the output, the name
column is interpreted as a text column that does not allow missing values.
This is probably as desired.
The age
column gets interpreted as a fixed-point column because of its missing values.
Since fixed point columns currently support only a limited number of operations, this is undesirable.
So, in this case, we need to manually set the ctype.
For age
an appropriate ctype would be uint8?
– this is shorthand for int[nullable,min=0,max=255]
.
Note
The ?
after the data type indicates that there are missing/null values in that column.
>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12, nullable=True)])
Note that, because the city
column in the CSV contains empty values, this column is interpreted as a text column
that allows missing values (nullable=True
), similarly to what happens in pandas. If this is not desired, the
missing values can be eliminated at upload time by loading the CSV using pandas and uploading the pandas dataframe, e.g.:
>>> df = pd.read_csv("test.csv").fillna({"city": ""})
>>> cdf = cd.upload_pandas_dataframe(df)
It is also possible replace the missing values in crandas, see the example of CSeries.fillna()
below.
Missing string values¶
Added in version 1.8: The VDL also supports a nullable string type, as in the example above.
Warning
When reading a csv, pandas automatically interprets various strings, such as NA
and N/A
, as being missing values.
See the documentation for pandas’ read_csv
function, in particular the na_values
option.
If you are uploading data through crandas and it contains such a string, be aware that it will be interpreted as a missing
value and make the column a nullable column. To avoid this, you can read the CSV using pandas.read_csv
using the
keep_default_na
option (e.g., df = pd.read_csv("test.csv", keep_default_na=False)
), and then upload the pandas
dataframe using crandas (e.g., cd.upload_pandas_dataframe(df)
).
Using columns with missing values¶
Row-wise operations on columns, such as the addition of columns, the CSeries.if_else()
function, and others, can be
applied to columns with missing values. In such cases, whenever one of the inputs is missing, then the output will be
considered missing as well, e.g.:
import crandas as cd
import pandas as pd
cdf = cd.DataFrame(
{
"guard": pd.Series([1, 0, pd.NA], dtype="Int64"),
"ifval": ["a", "b", "cc"],
"elseval": [pd.NA, "b", "Cc"],
}, auto_bounds=True
)
>>> cdf.assign(res=lambda x: x.guard.if_else(x.ifval, x.elseval)).open()
guard ifval elseval res
0 1 a <NA> <NA>
1 0 b b b
2 <NA> cc Cc <NA>
The same also applies for comparison operators ==
and !=
: if one of the inputs is missing, then the output is
also considered missing.
Note
The way the comparison operators work is the same as pandas behavior for Int64
columns but different from pandas behavior for most other column types (where, by convention, pandas considers missing values to be unequal to anything else).
Various other types of operations can also be applied to missing values, but some limitations apply. In particular, it is possible to use columns with missing values in a join, but not as the join column. It is also not possible to perform a group-by on a column with missing values. To apply such operations, replace the missing values as discussed below.
Finding missing values¶
Once we have uploaded the data, we want to ensure that the data we are accessing is not missing.
For that, we have two methods to check whether values of a column in the CDataFrame
are missing:
One is the negation of the other and they output a column (a CSeriesFun
) with zeroes and ones representing whether the values are null or not.
# This returns a one when the respective values are NULL, zero otherwise
missing_age = data["age"].isna()
# Returns a one when the respective values are not NULL
not_missing_age = data["age"].notna()
Replacing missing values¶
We can convert a nullable column to a non-nullable column by replacing missing values by a fixed value using CSeries.fillna()
.
# Use above example
cdf = cd.read_csv("test.csv", ctype={"age":'int?'})
# Replace empty ages by 0; empty cities by Eindhoven
cdf = cdf.assign(age=lambda x: x.age.fillna(0), city=lambda x: x.city.fillna("Eindhoven"))
In the next section, we will present a way to deal with categorical data in crandas.