Data types¶
This section provides an overview of which data types the Virtual Data Lake (VDL) supports and how we can convert between data types using crandas.
The supported data types include:
int8
,int16
,int24
,int32
,int40
,int48
,int56
,int64
,int72
,int80
,int88
,int96
uint8
,uint16
,uint24
,uint32
,uint40
,uint48
,uint56
,uint64
,uint72
,uint80
,uint88
,uint96
vec_int
(vector of integers)varchar
(text)date
bytes
(binary data)bool
(as integer)fixed-point numbers [1]
Added in version 1.8: Integers from ``int40``and bigger and dates.
Added in version 1.7: Fixed-point numbers
For specific details about numeric values, see the next section.
For each data type, also a “nullable” variant is supported that can hold missing values.
Nullable data data types are denoted with a question mark, e.g., int8?
, bytes??
.
Nullable data types have some caveats; read on or go to their own section for details.
Warning
When boolean values are uploaded into the VDL, they are transformed to integers and therefore take the values 0
and 1
instead of False
and True
respectively.
CDataFrame
allow for missing values, but not by default, so they must be specified.
Data Type Conversion¶
Crandas provides a method for converting data types using the CSeries.astype()
method. In the following example, we will show you how to convert a column of strings to a column of integers.
Note
This is the only type conversion currently supported.
import crandas as cd
# Create a crandas DataFrame with a string column
uploaded = cd.DataFrame({"vals": ["1","2","3","4"]})
# Convert the string column to a int column
uploaded = uploaded.assign(vals2=uploaded["vals"].astype(int))
The above example converts the string column vals
to a new integer column called vals2
.
For a more in depth look at specifying integer types, go to the next section.
Note
It is also possible to specify the desired type while uploading the data, using ctype={"val": "varchar[9]"}
.
ctypes¶
Because the VDL uses highly specialized algorithms to compute on secret data, it uses a specialized typing system that is more fine-grained than what pandas uses.
Crandas implements this type system, which we call ctypes (similarly to pandas’ dtypes
).
In certain situations, it is important to specify the specific type of our data
import pandas as pd
from crandas import ctypes
# Specify data types for the DataFrame
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", "ccc"]},
ctype={"ints": "int8", "strings": "varchar[5]"},
)
In the above example, we define the ints
column with a NonNullableInteger
data type (crandas.ctypes.NonNullableInteger()
), and the strings
column is defined with a varchar[5]
data type (string with at most 5 characters).
Hint
If there are missing/null values in the column that can be specified by adding ?
after the ctype (e.g. int8?
, varchar?[5]
)
Crandas also supports other data types, such as byte arrays:
from uuid import uuid4
# Create a DataFrame with UUIDs stored as bytes
df = cd.DataFrame({"uuids": [uuid4().bytes for _ in range(5)]}, ctype="bytes")
You are also able to specify types through pandas’ typing, known as dtypes. Note that not all dtypes have an equivalent ctypes.
# Create a DataFrame with multiple data types
df = cd.DataFrame(
{
"strings": pd.Series(["test", "hoi", "ok"], dtype="string"),
"int": pd.Series([1, 2, 3], dtype="int"),
"int64": pd.Series([23, 11, 91238], dtype="int64"),
"int32": pd.Series([12831, 1231, -1231], dtype="int32"),
}
)
Type detection from data¶
For some types of column, crandas can derive the exact column type from the data being uploaded.
This may lead to ColumnBoundDerivedWarning
warnings, e.g.:
ColumnBoundDerivedWarning: Column type for column a (uint8) was automatically
derived from the input data, see User Guide->Data types->Automatic type detection
for more information.
This warning is given if no exact ctype (e.g., uint8
, varchar[ascii]
) is given for the column.
For example, for an integer column, if no size is given, the smallest integer type is selected that is valid for the column, e.g.:
cdf = cd.DataFrame({"vals": [1,2,3]})
In this example, the column vals
will be derived to be of type uint8
(unsigned 8-bit integer)
because all values lie in the range from 0 to 255 (inclusive). See Automatic type detection for integers.
Similarly, for varchar (text) columns, it will be derived from the data whether the column is of ASCII or unicode type. See String encoding and efficiency.
Note that, because the column type is derived from the data, this potentially leaks information.
For example, if one party uploads a list of salaries that are all smaller than 65536
and another
party uploads a list of salaries that contains higher salaries, then the first column will have
column type uint16
and the second column will have column uint24
. The first party can prevent
this by explicitly assigning the ctype uint24
to the uploaded column.
When type information is detected from data, the user gets a warning about this, e.g.,
ColumnBoundDerivedWarning: Column "vals" was automatically derived to be of type uint8
.
These warnings can be suppressed with the standard Python mechanism, e.g.,
import warnings
# Suppress warnings about automatically derived column types
warnings.simplefilter("ignore", category=cd.ctypes.ColumnBoundDerivedWarning)
cdf = cd.DataFrame({"vals": [1,2,3]})
It is also possible to provide auto_bounds=True as argument to data uploading functions, or to set auto_bounds=True` globally for a crandas session:
import crandas as cd
# Suppress ColumnBoundDerivedWarning globally...
cd.base.session.auto_bounds = True
# or for a single upload
cdf = cd.DataFrame({"vals": [1,2,3]}, auto_bounds = True)
Working with missing values¶
Crandas can work with null values, although this requires extra care. Columns do not allow null values by default but this can be achieved in multiple ways. Whenever a column with missing values is added, the VDL will determine that such column can have null values. Additionally, it is possible to specify that a column will allow null values when uploading it, even if the column currently does not contain any such values.
Warning
When using a column with missing values in combination with script signing, it is advisable to explicitly specify that the column allows null values, by defining the ctype
as discussed below.
This way, there will not be a mismatch between the approved analysis and the performed analysis, even if the dummy or actual data does not contain nullable values.
For example, the following code designates the column ints
as allowing missing values, even if none of the uploaded values are missing.
from crandas import ctypes
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", None]},
ctype={"ints": "int32?"},
)
Hint
When specifying ctypes for columns with missing values you can use int32?
or any other supported data type with ?
(this indicates missing values).
Both columns created in this example allow for null values. The first one because it was explictly specified and the latter because it contains a null value.
Crandas considers the same values to be null as pandas; in particular, this includes None
, pandas.NA
, and numpy.nan
.
To turn a nullable column into a non-nullable one, the fillna
function can be used. For example, the following code example replaces all missing values of a string column by the string empty
:
import crandas as cd
cdf = cd.DataFrame({"a": ["test", None]})
cdf = cdf.assign(a=lambda x: x.a.fillna("empty"))
Numeric types have additional particularities that are important to know, both in the typing system and because we can do arithmetic operations over them. The next section deals with these types.