.. _nullable: Working with missing data ########################## Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this. Uploading data with missing values ---------------------------------- crandas supports uploading data with missing values automatically. In some cases, customization is needed by manually setting a column type to nullable using the :ref:`ctypes` system. For illustration, assume we have a CSV file called ``test.csv`` with the following data: .. code:: name,age,city Alice,25, Bob,,New York Charlie,30,Los Angeles crandas can upload this data, but customization is probably desired here: .. code:: python >>> cd.read_csv("test.csv") Handle: ... Size: 3 rows x 3 columns CIndex([Col("name", "s", 8), Col("age", "fp", 1, nullable=True), Col("city", "s", 12, nullable=True)]) As can be seen from the output, the ``name`` column is interpreted as a text column that does not allow missing values. This is probably as desired. The ``age`` column gets interpreted as a fixed-point column because of its missing values. Since fixed point columns currently support only a limited number of operations, this is undesirable. So, in this case, we need to manually set the ctype. For ``age`` an appropriate ctype would be ``uint8?`` -- this is shorthand for ``int[nullable,min=0,max=255]``. .. note:: The ``?`` after the data type indicates that there are missing/null values in that column. >>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"}) >>> people Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA Size: 3 rows x 3 columns CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12, nullable=True)]) .. TODO: When #585 is fixed this must be changed Note that, because the ``city`` column in the CSV contains empty values, this column is interpreted as a text column that allows missing values (``nullable=True``), similarly to what happens in pandas. If this is not desired, the missing values can be eliminated at upload time by loading the CSV using pandas and uploading the pandas dataframe, e.g.: >>> df = pd.read_csv("test.csv").fillna({"city": ""}) >>> cdf = cd.upload_pandas_dataframe(df) It is also possible replace the missing values in crandas, see the example of :meth:`.CSeries.fillna` below. Uploading Python null values ----------------------------- In Python and pandas, null values (i.e. missing values) can be represented in several ways. Python usually uses ``None`` to represent null, but pandas also uses ``pd.NA`` and ``np.nan``. Because these values work in slightly different ways, it is important to be careful with them. In general, when uploading numeric values null values should always be represented by the pandas value ``pd.NA``. Alternative values, like ``None``, might result in an error due to a wrong ``pd.Series`` type. This is because pandas treats columns with both numeric values and ``None`` as floating point numbers rather than integers. Therefore it is important to use the right null values. Similarly, date columns should use ``pd.NaT``. Missing string values ---------------------- .. versionadded:: 1.8 The engine also supports a nullable string type, as in the example above. .. warning:: When reading a csv, pandas automatically interprets various strings, such as ``NA`` and ``N/A``, as being missing values. See the documentation for pandas' ``read_csv`` function, in particular the ``na_values`` option. If you are uploading data through crandas and it contains such a string, be aware that it will be interpreted as a missing value and make the column a nullable column. To avoid this, you can read the CSV using ``pandas.read_csv`` using the ``keep_default_na`` option (e.g., ``df = pd.read_csv("test.csv", keep_default_na=False)``), and then upload the pandas dataframe using crandas (e.g., ``cd.upload_pandas_dataframe(df)``). Using columns with missing values --------------------------------- Row-wise operations on columns, such as the addition of columns, the :meth:`.CSeries.if_else` function, and others, can be applied to columns with missing values. In such cases, whenever one of the inputs is missing, then the output will be considered missing as well, e.g.: .. code:: python import crandas as cd import pandas as pd cdf = cd.DataFrame( { "guard": pd.Series([1, 0, pd.NA], dtype="Int64"), "ifval": ["a", "b", "cc"], "elseval": [pd.NA, "b", "Cc"], }, auto_bounds=True ) >>> cdf.assign(res=lambda x: x.guard.if_else(x.ifval, x.elseval)).open() guard ifval elseval res 0 1 a 1 0 b b b 2 cc Cc The same also applies for comparison operators ``==`` and ``!=``: if one of the inputs is missing, then the output is also considered missing. .. note:: The way the comparison operators work is the same as pandas behavior for ``Int64`` columns but different from pandas behavior for most other column types (where, by convention, pandas considers missing values to be unequal to anything else). Various other types of operations can also be applied to missing values, but some limitations apply. In particular, it is possible to use columns with missing values in a join, but *not as the join column*. It is also not possible to perform a :ref:`group-by` on a column with missing values. To apply such operations, replace the missing values as discussed below. .. attention:: The functions that do not work with nullable columns are the following; * Join (only on join columns) :func:`.merge` * Group by (including aggregations) :meth:`.CDataFrame.groupby` * Filter (columns used in filtering) * Regressions (when fitting a model but not predicting) e.g. :meth:`.crlearn.logistic_regression.LogisticRegression.fit` * Normalization :func:`.crlearn.utils.min_max_normalize` Finding missing values ------------------------- Once we have uploaded the data, we want to ensure that the data we are accessing is not missing. For that, we have two methods to check whether values of a column in the :class:`.CDataFrame` are missing: - :meth:`.CSeries.notna` - :meth:`.CSeries.isna` One is the negation of the other and they output a column (a :class:`.CSeriesFun`) with zeroes and ones representing whether the values are null or not. .. code:: python # This returns a one when the respective values are NULL, zero otherwise missing_age = data["age"].isna() # Returns a one when the respective values are not NULL not_missing_age = data["age"].notna() Replacing or deleting missing values ------------------------------------- We can convert a nullable column to a non-nullable column by replacing missing values by a fixed value using :meth:`.CSeries.fillna`. .. code:: python # Use above example cdf = cd.read_csv("test.csv", ctype={"age":'int?'}) # Replace empty ages by 0; empty cities by Eindhoven cdf = cdf.assign(age=lambda x: x.age.fillna(0), city=lambda x: x.city.fillna("Eindhoven")) Sometimes there is no natural value that we can use to replace missing values. For example, when doing a regression, using a filler value will often introduce bias to the model. In that case we might want to get rid of all rows that have missing values. To do that, we can use the :meth:`.CDataFrame.dropna` method. This will delete all rows where any column holds a null value. Additionally, it will make all columns non-nullable. .. code:: python # Using example from the top cdf = cd.read_csv("test.csv", ctype={"age":'int?'}) # Delete all columns with a null value cdf_nonull = cdf.dropna() # Only the last row, ["Charlie", 30, "Los Angeles"], will be left .. note:: In pandas, the ``dropna()`` method can be specified to delete either the rows or the columns with null values. Because the columns are not considered private information in the engine, crandas only supports deleting rows with null values. To remove columns with null values, you could manually check (and open) which columns contain null values, and use :meth:`.CDataFrame.project` to delete them. .. versionadded:: 1.10 :meth:`.CDataFrame.dropna` to delete rows that have null values. In the :ref:`next section`, we will present a way to deal with categorical data in crandas.