Working with numeric data
Crandas offers a wide range of tools for working with numeric data. These tools allow you to perform various operations on columns, such as arithmetic operations and aggregations, to analyze and process your data. In this guide, we will provide you with an introduction to working with these various operations. First, we will familiarize you with the particularities of numerical types in the engine.
Integer data types and conversion
Crandas supports signed and unsigned integer numbers with several bit
lengths: 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, and 96. For example,
a 16-bit unsigned integer is referred to as an uint16; a 24-bit signed
integer is referred to as an int24. It is possible to specify the type
of integer as a ctype (see ctypes):
Crandas does not allow computations to overflow. For example, the following does not work:
cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"})
cdf["cube"] = cdf["vals"] ** 3 # this overflows
In this case, because the column vals is 40-bit, its cube will be
120-bit, so the above computation fails with the error message
Numeric operation overflow: value does not fit in 96 bits.
Note
In many cases, the cube of a 16-int bit can easily fit in 16 bits; clearly \(3^3 = 9\) can be expressed with 16 bits (it can be expressed with only five). However, as the values in the engine are secret-shared it is very costly to preemptively check whether the result of an operation would fit the current type. Therefore, the engine assumes that it cannot fit, in order to prevent overflow errors.
In this example, it is possible to convert the column vals to another
integer type so that the computation of the cube can succeed:
cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"})
cdf["vals"] = cdf["vals"].astype("int8")
cdf["cube"] = cdf["vals"] ** 3 # this works
Note that the above conversion does not actually check that the column values lie in the specified range. If this is not the case, then the result of computing on the column is undefined. To check that the conversion succeeds, the following can be used:
cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int40"})
cdf["vals"] = cdf["vals"].astype("int8", validate=True)
cdf["cube"] = cdf["vals"] ** 3 # this works
Note
Given the bitlength b of an integer, its minimum and maximum values can
be calculated by taking 2^b. For a signed integer, we take one bit for
the sign (positive or negative). For example in a 24-bit integer:
Automatic type detection for integers
If no ctype is specified for an integer column, the smallest integer type is selected that is valid for the column:
In this example, the column vals will be derived to be of type uint8
(unsigned 8-bit integer) because all values lie in the range from \(0\) to
\(255\) (inclusive).
Generally, the server is able to execute operations faster for columns
that have a smaller range, e.g., filtering on the values of an uint8
column is faster than filtering on the values of a uint24 column.
Because the column type is derived from the data, this potentially leaks
information. Crandas warns about this by giving a
ColumnBoundDerivedWarning:
Fixed-point columns and conversions
The engine also supports the existence of fixed-point columns. Fixed-point is a way to represent non-integer numbers by determining a fixed number of digits after the decimal point. This representation is less flexible than floating-point numbers, which is the usual way these numbers are represented. However, they are more efficient when doing calculations in MPC.
When uploading a fixed-point column, we do not only need to name the
bounds explicitly, but we might also want to determine the precision.
The precision parameter determines the number of digits after the
decimal point. Default precision is 20 bits, which roughly corresponds
to six decimal places.
Note
Given the bitlength b of a fixed-point number, its minimum & maximum
values can be calculated using its bitlength and precision.(1) Also,
we take one bit for the sign. For example, for a
32-bit fixed-point number having 20-bit precision:
- Default precision is
20
To specify that we are uploading a fixed-point column, we use the name fp
df = cd.DataFrame(
{"fixedpoints": [1.2, 0.4, 3]},
ctype={"fixedpoints": "fp[precision=10,min=0.4,max=3]"},
)
Warning
When uploading fixed-point data without specifying precision, the
engine will not check the optimal precision for the data. It will simply
truncate the data at the default precision (20).
Note that when we open the table, the values are not exact, as some rounding is needed to represent the value in this form.
Note
As we have a limited space to represent each number, the higher the precision, the smaller the accepted range for the number. Therefore, the highest precision is not necessarily the best.
Fixed-point columns can be converted similar to fixed-point columns of different sizes:
cdf = cd.DataFrame({"vals": [1.0, 2.0, 3.0]}, ctype={"vals": "fp24[precision=10]"})
cdf["vals"] = cdf["vals"].astype("fp16[precision=10]", validate=True)
cdf["cube"] = cdf["vals"] ** 3 # this works
Fixed-point columns can be converted similar to fixed-point columns with higher precision:
cdf = cd.DataFrame({"vals": [1.0, 2.0, 3.0]}, ctype={"vals": "fp24[precision=10]"})
cdf["vals"] = cdf["vals"].astype("fp24[precision=20]", validate=True)
cdf["cube"] = cdf["vals"] ** 3 # this works
but also the other way around. That is converting to a lower precision:
cdf = cd.DataFrame({"vals": [1.0, 2.0, 3.0]}, ctype={"vals": "fp32[precision=20]"})
cdf["vals"] = cdf["vals"].astype("fp24[precision=10]", validate=True)
Integer columns can be converted to fixed-point columns
cdf = cd.DataFrame({"vals": [1, 2, 3]}, ctype={"vals": "int32"})
cdf["vals"] = cdf["vals"].astype("fp40[precision=10]", validate=True)
cdf["cube"] = cdf["vals"] ** 3 # this won't work (120 bits required)
and also fixed-point columns can be converted to integer types:
cdf = cd.DataFrame({"vals": [1.1, 2.2, 3.3]}, ctype={"vals": "fp32"})
cdf["vals"] = cdf["vals"].astype("int32", validate=True)
Automatic type detection for fixed-points
If no ctype is specified for an fixed-point column, the smallest fixed-point type is selected that is valid for the column:
In this example, the column vals will be derived to be of type
fp24[precision=20] (24-bit fixed-point using 20 bits for precision)
because all values lie in the range from \(-8\) and \(8\) (exclusive).
Note
20 bits are used for the precision, 1 bit is used for the sign, so there are only three bits left for the range bits (the bits left of the decimal point).
Consider a second example:
In this example, the column vals will be derived to be of type
fp32[precision=22], because we need more than 24 bits to represent the
value 2.0 with 22 bits of precision, while reserving 1 sign bit.
Boolean columns
As a special case, the ctype bool will be used for columns that
contain the boolean values True and False, and for columns for which
this ctype is explicitly specified:
cdf = cd.DataFrame({"col1": [True,True,False], # automatically a bool
"col2": [1,0,1]}, # specified using ctype
ctype={"col2": "bool"})
Certain operations, such as filtering on a column and using bitwise
operators &, |, ^, ~ work only on boolean columns. In the
example below, crandas implicitly understands that the column is
boolean.
Note
The engine checks whether the column has 0s and 1s. If so it will automatically treat it as a boolean.
In addition to this, crandas provides several aggregation operations for boolean columns.
df = cd.DataFrame({"a":[True, True, False]})
# Is any value in the column true?
a_any = df["a"].any()
# Are all values in the column true?
a_all = df["a"].all()
Elementwise Operations and Comparisons
In this section, we will demonstrate how to perform basic arithmetic
operations on columns using comparisons in crandas. All of these
operations output a CSeries with the
result of applying each operation to each row of the columns.
Note
The columns to compare may be of either or both integer and fixed-point type.
# Equal to
df_equal_to = df["a"] == df["b"]
# Not equal to
df_not_equal_to = df["a"] != df["b"]
# Less than
df_less_than = df["a"] < df["b"]
# Less than or equal to
df_less_than_equal = df["a"] <= df["b"]
# Greater than
df_greater_than = df["a"] > df["b"]
# Greater than or equal to
df_greater_than_equal = df["a"] >= df["b"]
# Add
df_add = df["a"] + df["b"]
# Subtract
df_subtract = df["a"] - df["b"]
# Multiply
df_multiply = df["a"] * df["b"]
# Negate
df_negate = -df["a"]
# Absolute value
df_absolute = abs(df["a"])
# Public integer exponentiation
df_pow = df["a"] ** 2
# Division
df_div = df["a"] / df["b"]
# Floor division
df_div = df["a"] // df["b"]
# Square root
df_sqroot = df["a"].sqrt()
Tip
The rounding errors of division depend on the magnitude of the numerator. To limit the error please consider using higher precision to represent the numbers.
Note
Division will always result in a fixed-point column. The function will raise an exception if it encounters at least one zero in its inputs so be aware that this might leak sensitive information.
Beyond these standard common operations, crandas has functions that compares two columns and returns a column with either the maximum or the minimum in each row:
# Calculate the minimum between "a" and "b" - returns a series
abmin = cd.series_min(df["a"], df["b"])
# Calculate the maximum between "a" and "b" - returns a series
abmax = cd.series_max(df["a"], df["b"])
Aggregation operations
In addition to this, crandas provides several aggregation operations, which can be applied to columns of either or both integer and/or fixed-point type.
- Sum:
.sum() - Mean:
.mean() - Standard deviation:
.std() - Variance:
.var() - Count:
.count() - Maximum:
.max() - Minimum:
.min() - Sum of squares:
.sum_squares()
df = cd.DataFrame({"a":[1,2,3,4], "b":[5,6,7,8]})
# Sum of a column
a_sum = df["a"].sum()
# Mean of a column
a_mean = df["a"].mean()
# Standard deviation of a column
a_std = df["a"].std()
# Variance of a column
a_var = df["a"].var()
# Count number of items of a column (not NULL elements)
a_count = df["a"].count()
# Maximum of a column
a_max = df["a"].max()
# Minimum of a column
a_min = df["a"].min()
# Sum of squares of a column
a_sum_of_squares = df["a"].sum_squares()
Maximum and minimum can be applied over a whole table, computing the maximum per (numeric) column.
# Compute the maximum of each (numeric) column in df
df_maximums = df.max()
# Compute the minimum of each (numeric) column in df
df_minimums = df.min()
Note
Note the difference between the elementwise functions that compare the
minimum/maximum of two columns (.min()
/ .max()) and
the ones that compute the minimum/maximum of the elements in one
column (.series_min()
/ .series_max()).
Operation overflow
When doing certain operations, one might encounter an operation overflow. An overflow happens when a value is bigger than the upper limit of that representation. For example, eight bits can only represent numbers up to \(256\), so attempting to add \(255 + 255\) will result on an overflow as we would need more bits. Normally, we would not have to worry much about this unless we were using especially big numbers. Unfortunately, that is not the case with the cryptographic engine.
Because all values in the engine are secret, the engine must consider
the possibility that they are the largest allowed type. If we are adding
two values that fit in eight bits (int8) then there is a possibility
that their sum requires one more bit to be represented. Therefore, if we
add many values, we might hit the limit of 96 bits in int96. This will
happen if the values actually are very far away from the limit!
Thankfully, the engine has ways to avoid this problem. The
validate() method takes a
DataFrame and validates that its
columns are a within certain bounds and remembers this for use in
operations. If we have an idea of the maximum value of a column, we can
use this function to avoid an overload error. We can use it to let the
engine know how big might a value possibly be.
# This table only has values between -1 and 1, but saved as the biggest possible integers
tab = cd.DataFrame({"col1":[1,-1,0],"col2":[0,1,0]},ctype={"col1":"int96", "col2":"uint96"})
# Attempting to add columns will give us an overflow error
tab["summed"] = tab["col1"]+tab["col2"]
# We validate that our tables values are between -1 and 1
tab = tab.validate(tab["col1"].in_range(-1,1))
# If this were not to be true, we would get an error
tab = tab.validate(tab["col2"].in_range(0,1))
# Now the operation works again
tab["summed"] = tab["col1"]+tab["col2"]
Working with Vectors
Crandas allows for vector operations on numeric vectors.
- Element-wise addition, including
.sum() - Element-wise product
- Inner product (
.inner()) - Vector sum of vector columns (
.vsum())
df = cd.DataFrame(
{
"col1": [[1, 2, 3], [3, 2, 4], [2, 3, 1]],
"col2": [[0.2, 0.3, 0.3], [0.4, 0.2, 0.6], [0.3, 0.3, 0.3]],
}
)
Addition and multiplication work the same way as if we working with scalars:
# Addition
df["col1"] + df["col2"]
# Aggregate element-wise sum
df["col1"].sum()
# Multiplication
df["col1"] * df["col2"]
Moreover, we also provide vector operations, such as inner product and
the sum of the elements of a vector. These two functions
(.inner() and
.vsum()) are implemented as
functions of the column, so the notation might be slightly unnatural (in
particular for the inner product).
# Assign a new column "inner" with the inner product of "col1" and "col2"
df["inner"] = df["col1"].inner(df["col2"])
# For each row in col1, add all the elements in that vector
df['col1'].vsum()
Generating random data
We can generate secret random numbers using the functions in the crandas.random module.
# Generate a table with a single column that contains 10 rows of uniformly random numbers between 1 and 5 (inclusive)
table = cd.random.randint(1, 5, num_rows=10).as_table()
# Add a column to `table` of uniformly random fixed-point numbers between 0 and 1
table["fix"] = cd.random.random(num_rows=10)
There are even more complex operations that we can do in numeric column, especially for data exploration. For more information on basic statistical functions, see the Descriptive Statistics section.
Now that we know how to work with numeric columns as well as the operations we can do with them, in the next section we will learn how to work with strings.