.. _nullable:

Working with missing data
##########################

Sometimes data has missing values but we still need to work with it.
Thankfully, crandas allows us to do this.

.. versionadded:: 1.8

    Nullable columns for all data types.

Uploading data with missing values
----------------------------------

crandas supports uploading data with missing values automatically.
In some cases, customization is needed by manually setting a column type to nullable using the :ref:`ctypes` system.

For illustration, assume we have a CSV file called ``test.csv`` with the following data:

.. code::

    name,age,city
    Alice,25,
    Bob,,New York
    Charlie,30,Los Angeles

crandas can upload this data, but customization is probably desired here:

.. code:: python

    >>> cd.read_csv("test.csv")
    Handle: ...
    Size: 3 rows x 3 columns
    CIndex([Col("name", "s", 8), Col("age", "fp", 1, nullable=True), Col("city", "s", 12, nullable=True)])

As can be seen from the output, the ``name`` column is interpreted as a text column that does not allow missing values.
This is probably as desired.

The ``age`` column gets interpreted as a fixed-point column because of its missing values.
Since fixed point columns currently support only a limited number of operations, this is undesirable.
So, in this case, we need to manually set the ctype.
For ``age`` an appropriate ctype would be ``uint8?`` -- this is shorthand for ``int[nullable,min=0,max=255]``.

.. note::

    The ``?`` after the data type indicates that there are missing/null values in that column.

>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12, nullable=True)])

.. TODO: When #585 is fixed this must be changed

Note that, because the ``city`` column in the CSV contains empty values, this column is interpreted as a text column
that allows missing values (``nullable=True``), similarly to what happens in pandas. If this is not desired, the
missing values can be eliminated at upload time by loading the CSV using pandas and uploading the pandas dataframe, e.g.:

>>> df = pd.read_csv("test.csv").fillna({"city": ""})
>>> cdf = cd.upload_pandas_dataframe(df)

It is also possible replace the missing values in crandas, see the example of :meth:`.CSeries.fillna` below.

Missing string values
----------------------

.. versionadded:: 1.8
    The VDL also supports a nullable string type, as in the example above.

.. warning::

   When reading a csv, pandas automatically interprets various strings, such as ``NA`` and ``N/A``, as being missing values.
   See the documentation for pandas' ``read_csv`` function, in particular the ``na_values`` option.

   If you are uploading data through crandas and it contains such a string, be aware that it will be interpreted as a missing
   value and make the column a nullable column. To avoid this, you can read the CSV using ``pandas.read_csv`` using the
   ``keep_default_na`` option (e.g., ``df = pd.read_csv("test.csv", keep_default_na=False)``), and then upload the pandas
   dataframe using crandas (e.g., ``cd.upload_pandas_dataframe(df)``).


Using columns with missing values
---------------------------------

Row-wise operations on columns, such as the addition of columns, the :meth:`.CSeries.if_else` function, and others, can be
applied to columns with missing values. In such cases, whenever one of the inputs is missing, then the output will be
considered missing as well, e.g.:

.. code:: python

    import crandas as cd
    import pandas as pd

    cdf = cd.DataFrame(
        {
            "guard": pd.Series([1,     0,   pd.NA], dtype="Int64"),
            "ifval":           ["a",   "b", "cc"],
            "elseval":         [pd.NA, "b", "Cc"],
        }, auto_bounds=True
    )


>>> cdf.assign(res=lambda x: x.guard.if_else(x.ifval, x.elseval)).open()
   guard ifval elseval   res
0      1     a    <NA>  <NA>
1      0     b       b     b
2   <NA>    cc      Cc  <NA>


The same also applies for comparison operators ``==`` and ``!=``: if one of the inputs is missing, then the output is
also considered missing. 

.. note::
    The way the comparison operators work is the same as pandas behavior for ``Int64`` columns but different from pandas behavior for most other column types (where, by convention, pandas considers missing values to be unequal to anything else).

Various other types of operations can also be applied to missing values, but some limitations apply. In particular, it
is possible to use columns with missing values in a join, but *not as the join column*. It is also not possible to perform
a :ref:`group-by<groupby>` on a column with missing values. To apply such operations, replace the missing values as discussed below.


Finding missing values
-------------------------
Once we have uploaded the data, we want to ensure that the data we are accessing is not missing.
For that, we have two methods to check whether values of a column in the :class:`.CDataFrame` are missing:

- :meth:`.CSeries.notna`
- :meth:`.CSeries.isna`

One is the negation of the other and they output a column (a :class:`.CSeriesFun`) with zeroes and ones representing whether the values are null or not.

.. code:: python

    # This returns a one when the respective values are NULL, zero otherwise
    missing_age = data["age"].isna()

    # Returns a one when the respective values are not NULL
    not_missing_age = data["age"].notna()


Replacing missing values
------------------------

We can convert a nullable column to a non-nullable column by replacing missing values by a fixed value using :meth:`.CSeries.fillna`.

.. code:: python

    # Use above example
    cdf = cd.read_csv("test.csv", ctype={"age":'int?'})

    # Replace empty ages by 0; empty cities by Eindhoven
    cdf = cdf.assign(age=lambda x: x.age.fillna(0), city=lambda x: x.city.fillna("Eindhoven"))


In the :ref:`next section<categorical>`, we will present a way to deal with categorical data in crandas.