Data types
This section provides an overview of which data types used by the engine and how we can convert between data types using crandas.
The supported data types include:
| Type | Datatypes |
|---|---|
| Integers | int8, int16, int24, int32, int40, int48, int56, int64, int72, int80, int88, int96 |
| Unsigned integers | uint8, uint16, uint24, uint32, uint40, uint48, uint56, uint64, uint72, uint80, uint88, uint96 |
| Fixed-point | fp8, fp16, fp24, fp32, fp40, fp48, fp56, fp64, fp72, fp80, fp88, fp96 |
| Integer vector | int_vec |
| Text | varchar[unicode], varchar[ascii] |
| Date | date |
| Boolean | bool |
| Binary data | bytes |
For specific details about numeric values, see the next section.
For each data type, also a nullable variant is supported that can
hold missing values. Nullable data data types are denoted with a
question mark, e.g., int8?, bytes?. Nullable data types have some
caveats; read on or go to
their own section for details.
DataFrame allow for missing values,
but not by default, so they must be specified.
Data Type Conversion
Crandas provides a method for converting data types using the
CSeries.astype() method. In the
following example, we will show you how to convert a column of strings
to a column of integers.
Note
Currently, the only cross-type conversions that are supported are from
string to int and from int to fp. It is also possible to convert
an integer column to a different size (e.g. int16 to int8 or
int64) or converting to a fixed point with a higher (but not lower)
precision (e.g fp24[precision=10] to fp24[precision=20])
import crandas as cd
# Create a crandas DataFrame with a string column
uploaded = cd.DataFrame({"vals": ["1","2","3","4"]})
# Convert the string column to a int column
uploaded["vals2"] = uploaded["vals"].astype(int)
The above example converts the string column vals to a new integer
column called vals2.
For a more in depth look at specifying integer types, go to the next section.
Tip
It is also possible to specify the desired type while uploading the
data, using ctype={"val": "varchar[9]"}.
ctypes
Because the engine uses highly specialized algorithms to compute on
secret data, it uses a specialized typing system that is more
fine-grained than what pandas uses. Crandas implements this type system,
which we call ctypes (similarly
to dtypes in pandas). In certain situations, it is important to
specify the specific type of our data
import pandas as pd
from crandas import ctypes
# Specify data types for the DataFrame
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", "ccc"]},
ctype={"ints": "int8", "strings": "varchar[5]"},
)
Or alternatively:
(...)
# Specify data types for the DataFrame
table = cd.DataFrame(
{"ints": cd.Series([1, 2, 3], ctype="int8"),
"strings": cd.Series(["a", "bb", "ccc"], ctype="varchar[5]")}
)
In the above example, we define the ints column with an 8-bit
integer data type that cannot contain missing values
and the strings column is defined with a varchar[5] data type
(string with at most 5 characters).
Tip
If there are missing/null values in the column that can be specified by
adding ? after the ctype (e.g. int8?, varchar?[5])
Crandas also supports other data types, such as byte arrays:
from uuid import uuid4
# Create a DataFrame with UUIDs stored as bytes
df = cd.DataFrame({"uuids": [uuid4().bytes for _ in range(5)]}, ctype="bytes")
You are also able to specify types through pandas typing (dtypes).
Note that not all dtypes have an equivalent ctypes.
# Create a DataFrame with multiple data types
df = cd.DataFrame(
{
"strings": pd.Series(["test", "hoi", "ok"], dtype="string"),
"int": pd.Series([1, 2, 3], dtype="int"),
"int64": pd.Series([23, 11, 91238], dtype="int64"),
"int32": pd.Series([12831, 1231, -1231], dtype="int32"),
}
)
It is possible to retrieve the ctype of a crandas DataFrame or a column
by using its .ctype attribute (see
DataFrame.ctype,
CSeriesColRef.ctype,
Col.ctype), for
example:
> cdf=cd.DataFrame({"a": [1], "b": "11"})
> print(cdf.ctype)
{'a': 'int[min=0,max=255]', 'b': 'varchar[2,ascii]'}
> print(cdf["a"].ctype)
int[min=0,max=255]
Using ctypes and schemas
As shown above, the ctype argument can be used to specify data types
of individual columns for upload functions such as
upload_pandas_dataframe(). Note that
this ctype includes metadata provided by the user at upload as well as
metadata (such as bounds; see Type detection from data) derived by the engine.
Instead of this, it is also possible to specify the data types of all
columns at the same time, by using the schema argument. The schema
specifies the order, names, and types of all columns to be uploaded, but
it does not contain engine-derived metadata.
Both ctype and schema are specified using a dictionary of a column
name mapping to a Ctype.
Although they take values of the same type, their use is different. A
schema is a "full blueprint" containing an ordered dictionary of
column names and types. The ctype argument is used when uploading a
table that does not need to conform to a particular layout, but the user
would like to specify some additional type information.
Both allow mapping to a Ctype that does
not have fully specified bounds: for example, in
both the ctype and schema arguments a particular column can be
specified as either int or int16. Obtaining the schema for an
existing DataFrame does however
currently not specify any bounds, e.g. it will always return int
even if the column was specified as int16.
Note
The ctype and schema arguments are mutually exclusive, i.e. only
one can be used at a time.
For example, the following specifies that the given CSV will be uploaded with the given column names in the specified order and with the specified types:
cd.read_csv(
"titanic_scaled.csv",
auto_bounds=True,
schema={"Survived": "int", "Pclass": "fp", "Sex": "int", "Age": "fp"},
)
When using script recording, schemas can be used to ensure that there is
an exact correspondence between the (dummy) data when recording a
script, and the (production) data used when executing the script. To
this end, in both cases, the same schema needs to be specified as
argument to upload functions (cdf=cd.DataFrame(..., schema=...);
cdf=cd.read_csv(..., schema=...);
cdf=cd.upload_pandas_dataframe(..., schema=...); etc), and/or to
cd.get_table(..., schema=...).
To learn the schema of data available in pandas format without actually
already uploading the data, the functions
pandas_dataframe_schema()
and read_csv_schema() can
be used, for example:
>>> cd.read_csv_schema("titanic_scaled.csv", auto_bounds=True, ctype={"Survived": "int"})
{'Survived': 'int', 'Pclass': 'fp', 'Sex': 'int', 'Age': 'fp'}
It is also possible to retrieve the schema of an existing
DataFrame by using its .schema
attribute (see DataFrame.schema,
CSeriesColRef.schema,
Col.schema). In this case, currently,
a known limitation is that, when
retrieving a DataFrame from the
engine, the fine-grained information from the original schema (for
example, bit lengths) is lost. Consider for example:
>>> cdf = cd.DataFrame({"a": [1], "b": "11"}, ctype={"a": "int16"})
>>> print(cdf.schema)
{'a': 'int16', 'b': 'varchar'}
>>> cdf = cd.get_table(cdf.handle)
>>> print(cdf.schema) # the fact that `a` is a 16-bit number is lost
{'a': 'int', 'b': 'varchar'}
However, whether a column is nullable is part of the schema. For example:
>>> cdf = cd.DataFrame({"a": [1, pd.NA], "b": "11"}, ctype={"a": "int16?"})
>>> print(cdf.schema)
{'a': 'int16?', 'b': 'varchar'}
>>> cdf = cd.get_table(cdf.handle)
>>> print(cdf.schema)
{'a': 'int?', 'b': 'varchar'}
Given a schema, it is also possible to create a zero-row crandas
DataFrame matching the schema by
using cdf = cd.DataFrame(schema=...) (without supplying any data).
It is also possible to manually create dummy data that adheres to a
given schema, by creating columns that have ctypes corresponding to the
respective schema ctypes. For example, given a schema
{'a': 'int', 'b': 'varchar'}, the following creates a table of dummy
data that matches the schema:
dummy_data = cd.DataFrame({
"a": cd.Series([], ctype="int"),
"b": cd.Series([], ctype="varchar"),
})
The dummy data can easily be instantiated by replacing the [] with
actual values.
Finally, to convert an already existing crandas
DataFrame to a given schema (if this
conversion is possible), cdf.astype(schema=...) can be used (see
DataFrame.astype()).
Type detection from data
For some types of column, crandas can derive the exact column type from
the data being uploaded. This may lead to ColumnBoundDerivedWarning
warnings:
ColumnBoundDerivedWarning: Column type for column a (uint8) was automatically
derived from the input data, see User Guide->Data types->Automatic type detection
for more information.
This warning is given if no exact ctype (e.g., uint8,
varchar[ascii]) is given for the column.
For example, for an integer column, if no size is given, the smallest integer type is selected that is valid for the column:
In this example, the column vals will be derived to be of type uint8
(unsigned 8-bit integer) because all values lie in the range from 0 to
255 (inclusive). See Automatic type detection for integers.
Similarly, for varchar (text) columns, it will be derived from the data whether the column is of ASCII or unicode type. See String encoding and efficiency.
Note that, because the column type is derived from the data, this
potentially leaks information. For example, if one party uploads a list
of salaries that are all smaller than 65536 and another party uploads
a list of salaries that contains higher salaries, then the first column
will have column type uint16 and the second column will have column
uint24. The first party can prevent this by explicitly assigning the
ctype uint24 to the uploaded column.
When type information is detected from data, the user gets a warning about this:
These warnings can be suppressed with the standard Python mechanism:
import warnings
# Suppress warnings about automatically derived column types
warnings.simplefilter("ignore", category=cd.ctypes.ColumnBoundDerivedWarning)
cdf = cd.DataFrame({"vals": [1,2,3]})
It is also possible to provide auto_bounds=True as argument to data
uploading functions (see Query Arguments), or
to set the configuration setting (see crandas.config) auto_bounds to True:
import crandas as cd
import crandas.config
# Suppress ColumnBoundDerivedWarning globally...
cd.config.settings.auto_bounds = True
# or for a single upload
cdf = cd.DataFrame({"vals": [1,2,3]}, auto_bounds=True)
Working with missing values
Crandas can work with null values, although this requires extra care. Columns do not allow null values by default but this can be achieved in multiple ways. Whenever a column with missing values is added, the engine will determine that such column can have null values. Additionally, it is possible to specify that a column will allow null values when uploading it, even if the column currently does not contain any such values.
Important
When using a column with missing values in combination with script
signing, it is advisable to explicitly specify that the column allows
null values, by defining the ctype. This way, there will not be a
mismatch between the approved analysis and the performed analysis, even
if the dummy or actual data does not contain nullable values.
The following code allows the column ints to hold missing values, even
if none of the uploaded values are missing.
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", None]},
ctype={"ints": "int32?"},
)
Tip
To specify ctypes for columns with missing values, use a question mark
? at the end of the ctype (e.g. int32?).
Both columns created in this example allow for null values. The first
one because it was explicitly specified and the latter because it
contains a null value. Crandas considers the same values to be null as
pandas; in particular, this includes None, pandas.NA, and
numpy.nan.
To turn a nullable column into a non-nullable one, the
.fillna() function can be used.
For example, the following code example replaces all missing values of a
string column by the string empty:
Numeric types have additional particularities that are important to know, both in the typing system and because we can do arithmetic operations over them. The next section deals with these types.