Working with missing data¶
Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this.
Uploading data with missing values¶
crandas supports uploading data with missing values automatically. In some cases, customization is needed by manually setting a column type to nullable using the crandas.ctypes system.
For illustration, assume we have a CSV file called test.csv
with the following data:
name,age,city
Alice,25,
Bob,,New York
Charlie,30,Los Angeles
crandas can upload this data, but customization is probably desired here:
>>> cd.read_csv("test.csv")
Handle: ...
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "fp", 1, nullable=True), Col("city", "s", 12, nullable=True)])
As can be seen from the output, the name
column is interpreted as a text column that does not allow missing values.
This is probably as desired.
The age
column gets interpreted as a fixed-point column because of its missing values.
Since fixed point columns are less efficient than integer columns, this is undesirable.
So, in this case, we need to manually set the ctype.
For age
an appropriate ctype would be uint8?
– this is shorthand for int[nullable,min=0,max=255]
.
Warning
Any numerical columns with null values will be automatically converted to fixed-point, consistent with the behaviour of pandas.
If it seems that the column contains only integers, a warning will be presented during upload.
It is recommended to always specify the ctype
in these cases.
Note
The ?
after the data type indicates that there are missing/null values in that column.
>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12, nullable=True)])
Note that, because the city
column in the CSV contains empty values, this column is interpreted as a text column
that allows missing values (nullable=True
), similarly to what happens in pandas. If this is not desired, the
missing values can be eliminated at upload time by loading the CSV using pandas and uploading the pandas dataframe, e.g.:
>>> df = pd.read_csv("test.csv").fillna({"city": ""})
>>> cdf = cd.upload_pandas_dataframe(df)
It is also possible replace the missing values in crandas, see the example of CSeries.fillna()
below.
Uploading Python null values¶
In Python and pandas, null values (i.e. missing values) can be represented in several ways.
Python usually uses None
to represent null, but pandas also uses pd.NA
and np.nan
.
Because these values work in slightly different ways, it is important to be careful with them.
In general, when uploading numeric values null values should always be represented by the pandas value pd.NA
.
Alternative values, like None
, might result in an error due to a wrong pd.Series
type.
This is because pandas treats columns with both numeric values and None
as floating point numbers rather than integers.
Therefore it is important to use the right null values.
Similarly, date columns should use pd.NaT
.
Missing string values¶
Added in version 1.8: The engine also supports a nullable string type, as in the example above.
Warning
When reading a csv, pandas automatically interprets various strings, such as NA
and N/A
, as being missing values.
See the documentation for pandas’ read_csv
function, in particular the na_values
option.
If you are uploading data through crandas and it contains such a string, be aware that it will be interpreted as a missing
value and make the column a nullable column. To avoid this, you can pass the keep_default_na
option to the crandas
read_csv
function, e.g., df = cd.read_csv("test.csv", keep_default_na=False)
.
Using columns with missing values¶
Row-wise operations on columns, such as the addition of columns, the CSeries.if_else()
function, and others, can be
applied to columns with missing values. In such cases, whenever one of the inputs is missing, then the output will be
considered missing as well, e.g.:
import crandas as cd
import pandas as pd
cdf = cd.DataFrame(
{
"guard": pd.Series([1, 0, pd.NA], dtype="Int64"),
"ifval": ["a", "b", "cc"],
"elseval": [pd.NA, "b", "Cc"],
}, auto_bounds=True
)
>>> cdf.assign(res=lambda x: x.guard.if_else(x.ifval, x.elseval)).open()
guard ifval elseval res
0 1 a <NA> <NA>
1 0 b b b
2 <NA> cc Cc <NA>
The same also applies for comparison operators ==
and !=
: if one of the inputs is missing, then the output is
also considered missing.
Note
The way the comparison operators work is the same as pandas behavior for Int64
columns but different from pandas behavior for most other column types (where, by convention, pandas considers missing values to be unequal to anything else).
Various other types of operations can also be applied to missing values, but some limitations apply. In particular, it is possible to use columns with missing values in a join, but not as the join column. It is also not possible to perform a group-by on a column with missing values. To apply such operations, replace the missing values as discussed below.
Attention
The functions that do not work with nullable columns are the following;
Join (only on join columns)
merge()
Group by (including aggregations)
CDataFrame.groupby()
Min/Max:
CSeries.max()
,CSeries.min()
Filter (columns used in filtering)
Regressions (when fitting a model but not predicting) e.g.
crlearn.logistic_regression.LogisticRegression.fit()
Normalization
crlearn.utils.min_max_normalize()
Finding missing values¶
Once we have uploaded the data, we want to ensure that the data we are accessing is not missing.
For that, we have two methods to check whether values of a column in the CDataFrame
are missing:
One is the negation of the other and they output a column (a CSeriesFun
) with zeroes and ones representing whether the values are null or not.
# This returns a one when the respective values are NULL, zero otherwise
missing_age = data["age"].isna()
# Returns a one when the respective values are not NULL
not_missing_age = data["age"].notna()
Replacing or deleting missing values¶
We can convert a nullable column to a non-nullable column by replacing missing values by a fixed value using CSeries.fillna()
.
# Use above example
cdf = cd.read_csv("test.csv", ctype={"age":'int?'})
# Replace empty ages by 0; empty cities by Eindhoven
cdf = cdf.assign(age=lambda x: x.age.fillna(0), city=lambda x: x.city.fillna("Eindhoven"))
Sometimes there is no natural value that we can use to replace missing values. For example, when doing a regression, using a filler value will often introduce bias to the model. In that case we might want to get rid of all rows that have missing values.
To do that, we can use the CDataFrame.dropna()
method. This will delete all rows where any column holds a null value.
Additionally, it will make all columns non-nullable.
# Using example from the top
cdf = cd.read_csv("test.csv", ctype={"age":'int?'})
# Delete all columns with a null value
cdf_nonull = cdf.dropna()
# Only the last row, ["Charlie", 30, "Los Angeles"], will be left
Note
In pandas, the dropna()
method can be specified to delete either the rows or the columns with null values. Because the columns are not considered private information in the engine, crandas only supports deleting rows with null values. To remove columns with null values, you could manually check (and open) which columns contain null values, and use CDataFrame.project()
to delete them.
Added in version 1.10: CDataFrame.dropna()
to delete rows that have null values.
In the next section, we will present a way to deal with categorical data in crandas.