.. _nullable:

Working with missing data
##########################

Sometimes data has missing values but we still need to work with it.
Thankfully, crandas allows us to do this.

Uploading data with missing values
----------------------------------

crandas supports uploading data with missing values automatically.
In some cases, customization is needed by manually setting a column type to nullable using the :ref:`ctypes` system.

For illustration, assume we have a CSV file called ``test.csv`` with the following data:

.. code::

    name,age,city
    Alice,25,
    Bob,,New York
    Charlie,30,Los Angeles

crandas can upload this data, but customization is probably desired here:

.. code:: python

    >>> cd.read_csv("test.csv")
    Handle: ...
    Size: 3 rows x 3 columns
    CIndex([Col("name", "s", 8), Col("age", "fp", 1, nullable=True), Col("city", "s", 12, nullable=True)])

As can be seen from the output, the ``name`` column is interpreted as a text column that does not allow missing values.
This is probably as desired.

The ``age`` column gets interpreted as a fixed-point column because of its missing values.
Since fixed point columns currently support only a limited number of operations, this is undesirable.
So, in this case, we need to manually set the ctype.
For ``age`` an appropriate ctype would be ``uint8?`` -- this is shorthand for ``int[nullable,min=0,max=255]``.

.. note::

    The ``?`` after the data type indicates that there are missing/null values in that column.

>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12, nullable=True)])

.. TODO: When #585 is fixed this must be changed

Note that, because the ``city`` column in the CSV contains empty values, this column is interpreted as a text column
that allows missing values (``nullable=True``), similarly to what happens in pandas. If this is not desired, the
missing values can be eliminated at upload time by loading the CSV using pandas and uploading the pandas dataframe, e.g.:

>>> df = pd.read_csv("test.csv").fillna({"city": ""})
>>> cdf = cd.upload_pandas_dataframe(df)

It is also possible replace the missing values in crandas, see the example of :meth:`.CSeries.fillna` below.

Uploading Python null values
-----------------------------
In Python and pandas, null values (i.e. missing values) can be represented in several ways.
Python usually uses ``None`` to represent null, but pandas also uses ``pd.NA`` and ``np.nan``.
Because these values work in slightly different ways, it is important to be careful with them.
In general, when uploading numeric values null values should always be represented by the pandas value ``pd.NA``.
Alternative values, like ``None``, might result in an error due to a wrong ``pd.Series`` type.
This is because pandas treats columns with both numeric values and ``None`` as floating point numbers rather than integers.
Therefore it is important to use the right null values.
Similarly, date columns should use ``pd.NaT``.

Missing string values
----------------------

.. versionadded:: 1.8
    The engine also supports a nullable string type, as in the example above.

.. warning::

   When reading a csv, pandas automatically interprets various strings, such as ``NA`` and ``N/A``, as being missing values.
   See the documentation for pandas' ``read_csv`` function, in particular the ``na_values`` option.

   If you are uploading data through crandas and it contains such a string, be aware that it will be interpreted as a missing
   value and make the column a nullable column. To avoid this, you can read the CSV using ``pandas.read_csv`` using the
   ``keep_default_na`` option (e.g., ``df = pd.read_csv("test.csv", keep_default_na=False)``), and then upload the pandas
   dataframe using crandas (e.g., ``cd.upload_pandas_dataframe(df)``).


Using columns with missing values
---------------------------------

Row-wise operations on columns, such as the addition of columns, the :meth:`.CSeries.if_else` function, and others, can be
applied to columns with missing values. In such cases, whenever one of the inputs is missing, then the output will be
considered missing as well, e.g.:

.. code:: python

    import crandas as cd
    import pandas as pd

    cdf = cd.DataFrame(
        {
            "guard": pd.Series([1,     0,   pd.NA], dtype="Int64"),
            "ifval":           ["a",   "b", "cc"],
            "elseval":         [pd.NA, "b", "Cc"],
        }, auto_bounds=True
    )


>>> cdf.assign(res=lambda x: x.guard.if_else(x.ifval, x.elseval)).open()
   guard ifval elseval   res
0      1     a    <NA>  <NA>
1      0     b       b     b
2   <NA>    cc      Cc  <NA>


The same also applies for comparison operators ``==`` and ``!=``: if one of the inputs is missing, then the output is
also considered missing. 

.. note::
    The way the comparison operators work is the same as pandas behavior for ``Int64`` columns but different from pandas behavior for most other column types (where, by convention, pandas considers missing values to be unequal to anything else).

Various other types of operations can also be applied to missing values, but some limitations apply. In particular, it
is possible to use columns with missing values in a join, but *not as the join column*. It is also not possible to perform
a :ref:`group-by<groupby>` on a column with missing values. To apply such operations, replace the missing values as discussed below.

.. attention::

    The functions that do not work with nullable columns are the following;

    * Join (only on join columns) :func:`.merge`
    * Group by (including aggregations) :meth:`.CDataFrame.groupby`
    * Filter (columns used in filtering)
    * Regressions (when fitting a model but not predicting) e.g. :meth:`.crlearn.logistic_regression.LogisticRegression.fit`
    * Normalization :func:`.crlearn.utils.min_max_normalize`


Finding missing values
-------------------------
Once we have uploaded the data, we want to ensure that the data we are accessing is not missing.
For that, we have two methods to check whether values of a column in the :class:`.CDataFrame` are missing:

- :meth:`.CSeries.notna`
- :meth:`.CSeries.isna`

One is the negation of the other and they output a column (a :class:`.CSeriesFun`) with zeroes and ones representing whether the values are null or not.

.. code:: python

    # This returns a one when the respective values are NULL, zero otherwise
    missing_age = data["age"].isna()

    # Returns a one when the respective values are not NULL
    not_missing_age = data["age"].notna()


Replacing or deleting missing values
-------------------------------------

We can convert a nullable column to a non-nullable column by replacing missing values by a fixed value using :meth:`.CSeries.fillna`.

.. code:: python

    # Use above example
    cdf = cd.read_csv("test.csv", ctype={"age":'int?'})

    # Replace empty ages by 0; empty cities by Eindhoven
    cdf = cdf.assign(age=lambda x: x.age.fillna(0), city=lambda x: x.city.fillna("Eindhoven"))

Sometimes there is no natural value that we can use to replace missing values. For example, when doing a regression, using a filler value will often introduce bias to the model.
In that case we might want to get rid of all rows that have missing values.

To do that, we can use the :meth:`.CDataFrame.dropna` method. This will delete all rows where any column holds a null value.
Additionally, it will make all columns non-nullable.

.. code:: python

    # Using example from the top
    cdf = cd.read_csv("test.csv", ctype={"age":'int?'})

    # Delete all columns with a null value
    cdf_nonull = cdf.dropna()

    # Only the last row, ["Charlie", 30, "Los Angeles"], will be left

.. note::

    In pandas, the ``dropna()`` method can be specified to delete either the rows or the columns with null values. Because the columns are not considered private information in the engine, crandas only supports deleting rows with null values. To remove columns with null values, you could manually check (and open) which columns contain null values, and use :meth:`.CDataFrame.project` to delete them.

.. versionadded:: 1.10 
    :meth:`.CDataFrame.dropna` to delete rows that have null values.

In the :ref:`next section<categorical>`, we will present a way to deal with categorical data in crandas.