.. _datatypes:

Data types
##########

This section provides an overview of which data types used by the engine and how we can convert between data types using crandas.

The supported data types include:

- ``int8``, ``int16``, ``int24``, ``int32``, ``int40``, ``int48``, ``int56``, ``int64``, ``int72``, ``int80``, ``int88``, ``int96``
- ``fp8``, ``fp16``, ``fp24``, ``fp32``, ``fp40``, ``fp48``, ``fp56``, ``fp64``, ``fp72``, ``fp80``, ``fp88``, ``fp96``
- ``uint8``, ``uint16``, ``uint24``, ``uint32``, ``uint40``, ``uint48``, ``uint56``, ``uint64``, ``uint72``, ``uint80``, ``uint88``, ``uint96``
- ``vec_int`` (vector of integers)
- ``varchar`` (text)
- ``date``
- ``bytes`` (binary data)
- ``bool`` (as integer)

.. versionadded:: 1.10

    Fixed-point numbers from ``fp40`` and bigger

.. versionadded:: 1.8

    Integers from ``int40`` and bigger and dates.

.. versionadded:: 1.7

    Fixed-point numbers

For specific details about numeric values, see the :ref:`next section<numeric>`.

For each data type, also a "nullable" variant is supported that can hold missing values.
Nullable data data types are denoted with a question mark, e.g., ``int8?``, ``bytes??``.
Nullable data types have some caveats; :ref:`read on<null_values>` or go to :ref:`their own section<nullable>` for details.

.. warning::

    When boolean values are uploaded into the engine, they are transformed to integers and therefore take the values ``0`` and ``1`` instead of ``False`` and ``True`` respectively.


:class:`.CDataFrame` allow for missing values, but not by default, so they must be specified.

Data Type Conversion
=====================

Crandas provides a method for converting data types using the :meth:`.CSeries.astype` method. In the following example, we will show you how to convert a column of strings to a column of integers.

.. note::

    This is the only type conversion currently supported.

.. code:: python

    import crandas as cd

    # Create a crandas DataFrame with a string column
    uploaded = cd.DataFrame({"vals": ["1","2","3","4"]})

    # Convert the string column to a int column
    uploaded = uploaded.assign(vals2=uploaded["vals"].astype(int))


The above example converts the string column ``vals`` to a new integer column called ``vals2``.

For a more in depth look at specifying integer types, go to the :ref:`next section<numeric>`.

.. note::

    It is also possible to specify the desired type while uploading the data, using ``ctype={"val": "varchar[9]"}``.


.. _ctypes_section:

ctypes
======

Because the engine uses highly specialized algorithms to compute on secret data, it uses a specialized typing system that is more fine-grained than what pandas uses.
Crandas implements this type system, which we call :ref:`ctypes<Ctypes>` (similarly to pandas' ``dtypes``).
In certain situations, it is important to specify the specific type of our data

.. code:: python

    import pandas as pd
    from crandas import ctypes

    # Specify data types for the DataFrame
    table = cd.DataFrame(
        {"ints": [1, 2, 3], "strings": ["a", "bb", "ccc"]},
        ctype={"ints": "int8", "strings": "varchar[5]"},
    )

or alternatively:

.. code:: python

    (...)
    # Specify data types for the DataFrame
    table = cd.DataFrame(
        {"ints": cd.Series([1, 2, 3], ctype="int8"),
        "strings": cd.Series(["a", "bb", "ccc"], ctype="varchar[5]")}
    )

In the above example, we define the ``ints`` column with a ``NonNullableInteger`` data type (:meth:`crandas.ctypes.NonNullableInteger`), and the ``strings`` column is defined with a ``varchar[5]`` data type (string with at most 5 characters).

.. hint::

    If there are missing/null values in the column that can be specified by adding ``?`` after the ctype (e.g. ``int8?``, ``varchar?[5]``)

Crandas also supports other data types, such as byte arrays:

.. code:: python

    from uuid import uuid4

    # Create a DataFrame with UUIDs stored as bytes
    df = cd.DataFrame({"uuids": [uuid4().bytes for _ in range(5)]}, ctype="bytes")

You are also able to specify types through pandas' typing, known as dtypes. Note that not all dtypes have an equivalent ctypes.

.. code:: python

    # Create a DataFrame with multiple data types
    df = cd.DataFrame(
        {
            "strings": pd.Series(["test", "hoi", "ok"], dtype="string"),
            "int": pd.Series([1, 2, 3], dtype="int"),
            "int64": pd.Series([23, 11, 91238], dtype="int64"),
            "int32": pd.Series([12831, 1231, -1231], dtype="int32"),
        }
    )


It is possible to retrieve the ctype of a crandas DataFrame or a column by using its ``.ctype`` attribute
(see :meth:`crandas.crandas.CDataFrame.ctype`, :meth:`crandas.crandas.CSeriesColRef.ctype`,
:meth:`crandas.crandas.Col.ctype`), for example:

.. code:: python

    > cdf=cd.DataFrame({"a": [1], "b": "11"})
    > print(cdf.ctype)
    {'a': 'int[min=0,max=255]', 'b': 'varchar[2,ascii]'}
    > print(cdf["a"].ctype)
    int[min=0,max=255]


.. _ctypes_schemas:

Using ctypes and schemas
========================

As shown above, the ``ctype`` argument can be used to specify data types of individual
columns for upload functions such as :meth:`crandas.crandas.DataFrame`.
Note that this ctype includes metadata provided by the user at upload as well as
metadata (such as bounds; see :ref:`type_detection`) derived by the engine.

Instead of this, it is also possible to specify the data types of all columns at the
same time, by using the ``schema`` argument. The schema specifies the order, names, and
types of all columns to be uploaded, but it does not contain engine-derived metadata.

.. note::
    The ``ctype`` and ``schema`` arguments are mutually exclusive. Although they take
    values of the same type, i.e. both are specified using a dictionary of a column name
    mapping to a :class:`crandas.ctype.Ctype`, their use is different. A schema is a
    "full blueprint" containing an ordered dictionary of column names and types. The
    ``ctype`` argument is used when uploading a table that does not need to conform to
    a particular layout, but the user would like to specify some additional type information.
    Both allow mapping to a :class:`crandas.ctype.Ctype` that does not have fully specified
    bounds: for example, in both the ``ctype`` and ``schema`` arguments a particular column
    can be specified as either ``int`` or ``int16``. Obtaining the schema for an existing
    CDataFrame does however **currently not specify any bounds**, e.g. it will always return
    ``int`` even if the column was specified as ``int16``; see the known limitation described below.

For example, the following specifies that the given CSV will be uploaded with the given
column names in the specified order and with the specified types:

.. code:: python

    >>> cd.read_csv("titanic_scaled.csv", auto_bounds=True, schema={'Survived': 'int',
    'Pclass': 'fp', 'Sex': 'int', 'Age': 'fp'})

When using script recording, schemas can be used to ensure that there is an exact correspondence
between the (dummy) data when recording a script, and the (production) data used when
executing the script. To this end, in both environments, the same ``schema`` needs to
be specified as argument to upload functions (``cdf=cd.DataFrame(..., schema=...)``;
``cdf=cd.read_csv(..., schema=...)``; ``cdf=cd.upload_pandas_dataframe(..., schema=...)``;
etc), and/or to ``cd.get_table(..., schema=...)``.

To learn the schema of data available in pandas format without actually already uploading
the data, the functions :meth:`crandas.crandas.pandas_dataframe_schema` and
:meth:`crandas.crandas.read_csv_schema` can be used, for example:

.. code:: python

    >>> cd.read_csv_schema("titanic_scaled.csv", auto_bounds=True, ctype={"Survived": "int"})
    {'Survived': 'int', 'Pclass': 'fp', 'Sex': 'int', 'Age': 'fp'}

It is also possible to retrieve the schema of an existing CDataFrame by using its ``.schema``
attribute (see :meth:`crandas.crandas.CDataFrame.schema`,
:meth:`crandas.crandas.CSeriesColRef.schema`, :meth:`crandas.crandas.Col.schema`). In this case,
currently, a known limitation is that, when retrieving a CDataFrame from the engine, the
fine-grained information from the original schema (for example, bit lengths) is lost.
Consider for example:

.. code:: python

    >>> cdf = cd.DataFrame({"a": [1], "b": "11"}, ctype={"a": "int16"})
    >>> print(cdf.schema)
    {'a': 'int16', 'b': 'varchar'}
    >>> cdf = cd.get_table(cdf.handle)
    >>> print(cdf.schema) # the fact that `a` is a 16-bit number is lost
    {'a': 'int', 'b': 'varchar'}

However, whether a column is nullable is part of the schema. For example:

.. code:: python

    >>> cdf = cd.DataFrame({"a": [1, pd.NA], "b": "11"}, ctype={"a": "int16?"})
    >>> print(cdf.schema)
    {'a': 'int16?', 'b': 'varchar'}
    >>> cdf = cd.get_table(cdf.handle)
    >>> print(cdf.schema)
    {'a': 'int?', 'b': 'varchar'}

Given a schema, it is also possible to create a zero-row crandas dataframe matching the schema
by using ``cdf = cd.DataFrame(schema=...)`` (without supplying any data).

It is also possible to manually create dummy data that adheres to a given schema, by creating
columns that have ctypes corresponding to the respective schema ctypes. For example, given a
schema ``{'a': 'int', 'b': 'varchar'}``, the following creates a table of dummy data
that matches the schema:

.. code:: python

    dummy_data = cd.DataFrame({
        "a": cd.Series([], ctype="int"),
        "b": cd.Series([], ctype="varchar"),
    })

The dummy data can easily be instantiated by replacing the ``[]`` with actual values.

Finally, to convert an already existing crandas dataframe to a given schema (if this conversion is
possible), ``cdf.astype(schema=...)`` can be used (see :meth:`crandas.crandas.CDataFrame.astype`).


.. _type_detection:

Type detection from data
========================

For some types of column, crandas can derive the exact column type from the data being uploaded.
This may lead to ``ColumnBoundDerivedWarning`` warnings, e.g.:

.. code:: text

    ColumnBoundDerivedWarning: Column type for column a (uint8) was automatically
    derived from the input data, see User Guide->Data types->Automatic type detection
    for more information.

This warning is given if no exact ctype (e.g., ``uint8``, ``varchar[ascii]``) is given for the column.

For example, for an integer column, if no size is given, the smallest integer type is selected
that is valid for the column, e.g.:

.. code:: python

    cdf = cd.DataFrame({"vals": [1,2,3]})

In this example, the column ``vals`` will be derived to be of type ``uint8`` (unsigned 8-bit integer)
because all values lie in the range from 0 to 255 (inclusive). See :ref:`int_type_detection`.

Similarly, for varchar (text) columns, it will be derived from the data whether the column is of
ASCII or unicode type. See :ref:`string_ascii`.

Note that, because the column type is derived from the data, this potentially leaks information.
For example, if one party uploads a list of salaries that are all smaller than ``65536`` and another
party uploads a list of salaries that contains higher salaries, then the first column will have
column type ``uint16`` and the second column will have column ``uint24``. The first party can prevent
this by explicitly assigning the ctype ``uint24`` to the uploaded column.

When type information is detected from data, the user gets a warning about this, e.g.,
``ColumnBoundDerivedWarning: Column "vals" was automatically derived to be of type uint8``.
These warnings can be suppressed with the standard Python mechanism, e.g.,

.. code:: python

    import warnings

    # Suppress warnings about automatically derived column types
    warnings.simplefilter("ignore", category=cd.ctypes.ColumnBoundDerivedWarning)

    cdf = cd.DataFrame({"vals": [1,2,3]})

It is also possible to provide ``auto_bounds=True`` as argument to data uploading functions (see :ref:`queryargs`, or to set
the configuration setting (see :mod:`crandas.config`) ``auto_bounds`` to ``True``:

.. code:: python

    import crandas as cd
    import crandas.config

    # Suppress ColumnBoundDerivedWarning globally...
    cd.config.settings.auto_bounds = True

    # or for a single upload
    cdf = cd.DataFrame({"vals": [1,2,3]}, auto_bounds=True)


.. _null_values:

Working with missing values
=============================

Crandas can work with null values, although this requires extra care. Columns do not allow null values by default but this can be achieved in multiple ways. Whenever a column with missing values is added, the engine will determine that such column can have null values. Additionally, it is possible to specify that a column will allow null values when uploading it, even if the column currently does not contain any such values.

.. warning::

    When using a column with missing values in combination with script signing, it is advisable to explicitly specify that the column allows null values, by defining the ``ctype`` as discussed below.
    This way, there will not be a mismatch between the approved analysis and the performed analysis, even if the dummy or actual data does not contain nullable values.
    For example, the following code designates the column ``ints`` as allowing missing values, even if none of the uploaded values are missing.

.. code:: python

    from crandas import ctypes


    table = cd.DataFrame(
        {"ints": [1, 2, 3], "strings": ["a", "bb", None]},
        ctype={"ints": "int32?"},
    )

.. hint::

    When specifying ctypes for columns with missing values you can use ``int32?`` or any other supported data type with ``?`` (this indicates missing values).

Both columns created in this example allow for null values. The first one because it was explictly specified and the latter because it contains a null value.
Crandas considers the same values to be null as pandas; in particular, this includes ``None``, ``pandas.NA``, and ``numpy.nan``.

To turn a nullable column into a non-nullable one, the ``fillna`` function can be used. For example, the following code example replaces all missing values of a string column by the string ``empty``:

.. code:: python

    import crandas as cd

    cdf = cd.DataFrame({"a": ["test", None]})
    cdf = cdf.assign(a=lambda x: x.a.fillna("empty"))

Numeric types have additional particularities that are important to know, both in the typing system and because we can do arithmetic operations over them. The next section deals with these types.