.. _nullable: Working with missing data ########################## Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this. Uploading data with missing values ---------------------------------- Uploading data with missing values does not always work automatically. The main issue is that by default, data is considered to be always present (not null). To allow columns to have missing values, it should be manually set to nullable using the :ref:`ctypes` system. For illustration, assume we have a CSV file called ``test.csv`` with the following data: .. code:: name,age,city Alice,25, Bob,,New York Charlie,30,Los Angeles The following will not work: .. code:: python >>> cd.read_csv("test.csv") [...] NotImplementedError: FixedPoint ctype does not support nullable columns yet The above exception was the direct cause of the following exception: [...] SeriesEncodingError: Could not encode column age The problem is that the ``age`` column gets interpreted as a fixed-point column, which does not support missing values. We need to manually set the ctypes. For ``age`` an appropriate ctype would be ``uint8?`` -- this is shorthand for ``int[nullable,min=0,max=255]``. .. note:: The ``?`` after the data type indicates that there are missing/null values in that column. >>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"}) >>> people Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA Size: 3 rows x 3 columns CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12)]) .. TODO: When #585 is fixed this must be changed Missing string values ---------------------- Currently, the VDL does not support a nullable string type. However, in most cases, this is not needed. Often, like in the case of csv files, an empty string is used to represent a missing string entry. .. versionadded:: 1.5 The VDL converts any null value in a string column to an empty string ``""``. While this allows for many null value functionalities, be mindful that certain things works differently than expected. When looking for null values, you must use ``==""`` instead of the :meth:`.CSeries.isna` or :meth:`.CSeries.notna` functions. These functions will output all zeroes and ones respectively, as the data is not seen as null by the VDL. Additionally, when doing an inner join with string data, empty strings `will` be matched together. .. warning:: When reading a csv, pandas automatically assumes that the sting ``"NA"`` represents a missing value in a string column. If you are uploading data through crandas and it contains that string, be aware that it will be converted into ane empty string. Consider modifying it and then using :meth:`.CSeries.if_else` to convert the data back. Finding missing values ------------------------- Once we have uploaded the data, we want to ensure that the data we are accessing is not missing. For that, we have two methods to check whether values of a column in the :class:`.CDataFrame` are missing: - :meth:`.CSeries.notna` - :meth:`.CSeries.isna` One is the negation of the other and they output a column (a :class:`.CSeriesFun`) with zeroes and ones representing whether the values are null or not. .. code:: python # This returns a one when the respective values are NULL, zero otherwise missing_age = data["age"].isna() # Returns a one when the respective values are not NULL not_missing_age = data["age"].notna()