.. _numeric: Working with numeric data ######################### Crandas offers a wide range of tools for working with numeric data. These tools allow you to perform various operations on columns, such as arithmetic operations and aggregations, to analyze and process your data. In this guide, we will provide you with an introduction to working with these various operations. First, we will familiarize you with the particularities of numerical types in the engine. Integer data types and conversion ================================= Crandas supports signed and unsigned integer numbers with several bit lengths: 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, and 96. For example, a 16-bit unsigned integer is referred to as an ``uint16``; a 24-bit signed integer is referred to as an ``int24``. It is possible to specify the type of integer as a ctype (see :ref:`ctypes_section`), e.g.: .. code:: python import crandas as cd cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"}) Crandas does not allow computations to overflow. For example, the following does not work: .. code:: python cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"}) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this overflows In this case, because the column ``vals`` is 40-bit, its cube will be 120-bit, so the above computation fails with the error message ``Integer operation overflow: value does not fit in 96 bits``. .. note:: In many cases, the cube of a 16-int bit can easily fit in 16 bits; clearly 3^3 = 9 can be expressed with 16 bits (it can be expressed with only five). However, as the values in the engine are `secret-shared` it is very costly to preemptively check whether the result of an operation would fit the current type. Therefore, the engine assumes that it cannot fit, in order to prevent overflow errors. In this example, it is possible to convert the column ``vals`` to another integer type so that the computation of the cube can succeed, e.g.: .. code:: python cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"}) cdf = cdf.assign(vals=lambda x: x.vals.astype("int8")) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works Note that the above conversion does not actually check that the column values lie in the specified range. If this is not the case, then the result of computing on the column is undefined. To check that the conversion succeeds, the following can be used: .. code:: python cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"}) cdf = cdf.assign(vals=lambda x: x.vals.astype("int8", validate=True)) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works .. _int_type_detection: Automatic type detection for integers ------------------------------------- If no ctype is specified for an integer column, the smallest integer type is selected that is valid for the column, e.g.: .. code:: python cdf = cd.DataFrame({"vals": [1,2,3]}) In this example, the column ``vals`` will be derived to be of type ``uint8`` (unsigned 8-bit integer) because all values lie in the range from 0 to 255 (inclusive). Generally, the server is able to execute operations faster for columns that have a smaller range, e.g., filtering on the values of an ``uint8`` column is faster than filtering on the values of a ``uint24`` column. Because the column type is derived from the data, this potentially leaks information. Crandas warns about this by giving a ``ColumnBoundDerivedWarning``, e.g.: ``ColumnBoundDerivedWarning: Column "vals" was automatically derived to be of type uint8``. For more information, see: :ref:`type_detection`. Fixed-point columns and conversions ----------------------------------- .. versionadded:: 1.10 Fixed-point columns conversions and support for operations (arithmetic, join, filter, groupby, concatenation, cut) .. versionadded:: 1.7 Fixed-point columns The engine also supports the existence of fixed-point columns. Fixed-point is a way to represent non-integer numbers by determining a *fixed* number of digits after the decimal point. This representation is less flexible than *floating-point* numbers, which is the usual way these numbers are represented. However, they are more efficient when doing calculations in MPC. When uploading a fixed-point column, we do not only need to name the bounds explicitly, but we might also want to determine the precision. The ``precision`` parameter determines the number of digits after the decimal point. Default precision is ``20`` bits, which roughly corresponds to *six* decimal places. To specify that we are uploading a fixed-point column, we use the name ``fp`` .. code:: python df = cd.DataFrame({"fixedpoints":[1.2, 0.4, 3]}, ctype={"fixedpoints": "fp[precision=10,min=0,max=3]"}) .. warning:: When uploading fixed-point data without specifying ``precision``, the engine will not check the optimal precision for the data. It will simply truncate the data at the default precision (``20``). Note that when we open the table, the values are not exact, as some rounding is needed to represent the value in this form. >>> df.open() fixedpoints 0 1.200195 1 0.400391 2 3.000000 .. note:: As we have a limited space to represent each number, the higher the precision, the smaller the accepted range for the number. Therefore, the highest precision is not necessarily the best. Fixed-point columns can be converted similar to fixed-point columns of different sizes: .. code:: python cdf = cd.DataFrame({"vals": [1.0,2.0,3.0]}, ctype={"vals": "fp24[precision=10]"}) cdf = cdf.assign(vals=lambda x: x.vals.astype("fp16[precision=10]", validate=True)) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works Fixed-point columns can be converted similar to fixed-point columns with higher precision: .. code:: python cdf = cd.DataFrame({"vals": [1.0,2.0,3.0]}, ctype={"vals": "fp24[precision=10]"}) cdf = cdf.assign(vals=lambda x: x.vals.astype("fp24[precision=20]", validate=True)) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works but **not** the other way around. That is converting to a lower precision is not supported. Integer columns can be converted to fixed-point columns .. code:: python cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int32"}) cdf = cdf.assign(vals=lambda x: x.vals.astype("fp40[precision=10]", validate=True)) cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this won't work (120 bits required) but fixed-point columns **cannot** be converted to integer types. .. _fp_type_detection: Automatic type detection for fixed-points ----------------------------------------- If no ctype is specified for an fixed-point column, the smallest fixed-point type is selected that is valid for the column, e.g.: .. code:: python cdf = cd.DataFrame({"vals": [1.0,2.0,3.0]}) In this example, the column ``vals`` will be derived to be of type ``fp24[precision=20]`` (24-bit fixed-point using 20 bits for precision) because all values lie in the range from -8 and 8 (exclusive). .. note:: 20 bits are used for the precision, 1 bit is used for the sign, so there are only three bits left for the range bits (the bits left of the decimal point). Consider a second example: .. code:: python cdf = cd.DataFrame({"vals": [1.0,2.0,3.0]}, ctype={"vals": "fp[precision=22"}) In this example, the column ``vals`` will be derived to be of type ``fp32[precision=22]``, because we need more than 24 bits to represent the value ``2.0`` with 22 bits of precision, while reserving 1 sign bit. Boolean columns --------------- As a special case, the ctype ``bool`` will be used for columns that contain the boolean values ``True`` and ``False``, and for columns for which this ctype is explicitly specified: .. code:: python cdf = cd.DataFrame({"col1": [True,True,False], # automatically a bool "col2": [1,0,1]}, # specified using ctype ctype={"col2": "bool"}) Certain operations, such as filtering on a column and using binary operators ``&``, ``|``, ``^``, ``~`` work only on boolean columns. In the example below, crandas implicitly understands that the column is boolean. .. code:: python cdf = cd.DataFrame({"vals": [1,2,3], "filter": [1,0,1]}) cdf = cdf[cdf["filter"]] .. note:: The engine will check whether the column has 0s and 1s, if so it will automatically treat it as a boolean (even if it is not). Inspecting the integer type ---------------------------------- In the current version of crandas, it is not possible to check the integer type directly. The feature is coming soon and will be accessible through ``repr``. .. TODO: Add when issue #552 is fixed Elementwise Operations and Comparisons =========================================== In this section, we will demonstrate how to perform basic arithmetic operations on columns using comparisons in crandas. All of these operations output a :class:`.CSeries` with the result of applying each operation to each row of the columns. .. note:: The columns to compare may be of either or both integer and fixed-point type. .. code:: python df = cd.DataFrame({"a":[1,2,3,4], "b":[5.0,6.0,7.0,8.0]}) # Equal to df_equal_to = df["a"] == df["b"] # Not equal to df_not_equal_to = df["a"] != df["b"] # Less than df_less_than = df["a"] < df["b"] # Less than or equal to df_less_than_equal = df["a"] <= df["b"] # Greater than df_greater_than = df["a"] > df["b"] # Greater than or equal to df_greater_than_equal = df["a"] >= df["b"] # Add df_add = df["a"] + df["b"] # Subtract df_subtract = df["a"] - df["b"] # Multiply df_multiply = df["a"] * df["b"] # Negate df_negate = -df["a"] # Absolute value df_absolute = abs(df["a"]) # Public integer exponentiation df_pow = df["a"] ** 2 # Division df_div = df["a"] / df["b"] # Floor division df_div = df["a"] // df["b"] .. note:: The rounding errors of division depend on the magnitude of the numerator. To limit the error please consider using higher precision to represent the numbers. .. note:: Division will always result in a fixed-point column. The function will raise an exception if it encounters at least one zero in its inputs so be aware that this might leak sensitive information. Beyond these standard common operations, crandas has functions that compares two columns and returns a column with either the maximum or the minimum in each row: .. code:: python # Calculate the minimum between "a" and "b" - returns a series abmin = cd.series_min(df["a"], df["b"]) # Calculate the maximum between "a" and "b" - returns a series abmax = cd.series_max(df["a"], df["b"]) Aggregation operations ========================== In addition to this, crandas provides several aggregation operations, which can be applied to columns of either or both integer and/or fixed-point type. - Sum (:meth:`.CSeriesColRef.sum`) - Mean (:meth:`.CSeriesColRef.mean`) - Variance (:meth:`.CSeriesColRef.var`) - Count (:meth:`.CSeriesColRef.count`) - Maximum (:meth:`.CSeriesColRef.max`) - Minimum (:meth:`.CSeriesColRef.min`) - Sum of squares (:meth:`.CSeriesColRef.sum_squares`) .. code:: python df = cd.DataFrame({"a":[1,2,3,4], "b":[5,6,7,8]}) # Sum of a column a_sum = df["a"].sum() # Mean of a column a_mean = df["a"].mean() # Variance of a column a_var = df["a"].var() # Count number of items of a column (not NULL elements) a_count = df["a"].count() # Maximum of a column a_max = df["a"].max() # Minimum of a column a_min = df["a"].min() # Sum of squares of a column a_sum_of_squares = df["a"].sum_squares() Maximum and minimum can be applied over a whole table, computing the maximum per (numeric) column. .. code:: python # Compute the maximum of each (numeric) column in df df_maximums = df.max() # Compute the minimum of each (numeric) column in df df_minimums = df.min() .. note:: Note the difference between the elementwise functions that compare the minimum/maximum of two columns (:meth:`.CSeriesColRef.max` / :meth:`.CSeriesColRef.min`) and the ones that compute the minimum/maximum of *the elements in one column* (:func:`.crandas.series_min` / :func:`.crandas.series_max`) Working with Vectors ============================ Crandas allows for vector operations on integer vectors. - Element-wise addition - Element-wise product - Inner product (:meth:`.CSeries.inner`) - Vector sum of vector columns (:meth:`.CSeries.vsum`) .. code:: python df = cd.DataFrame({ "col1":[[1,2,3],[3,2,4],[2,3,1]], "col2":[[2,3,3],[4,2,6],[3,3,3]] }) Addition and multiplication work the same way as if we working with scalars: .. code:: python # Addition df["col1"] + df["col2"] # Multiplication df["col1"] * df["col2"] .. TODO: Scalar multiplication #609 Moreover, we also provide vector operations, such as inner product and the sum of the elements of a vector. These two functions (:meth:`.CSeries.inner` and :meth:`.CSeries.vsum`) are implemented as functions of the column, so the notation might be slightly unnatural (in particular for the inner product). .. code:: python # Assign a new column "inner" with the inner product of "col1" and "col2" df = df.assign(inner=df["col1"].inner(df["col2"])) # For each row in col1, add all the elements in that vector df['col1'].vsum() .. hint:: Whenever you want to access the result of a column operation without adding it to a table, use ``as_table()`` like this ``(df["col1"]+df["col2"]).as_table()``. .. warning:: Currently, integer vectors are only supported for integers that fit in ``int32`` or ``uint32``. Now that we know how to work with numeric columns as well as the operations we can do with them, in the next section we will learn how to combine multiple tables. This feature is one of the most important to allow for data sharing, as it will allow us to consolidate data from multiple sources into one.