crandas for Data Scientists¶
Crandas takes its name from pandas, a popular Python package that provides data structures and data analysis tools. The goal of crandas is to provide a seamless experience for analysts who are familiar with pandas, as we use similar functions, names and structures. The main difference is that the operations are done in the Roseman Labs engine, which uses Multi-Party Computation (MPC) to do the computations over encrypted data. In this way, you do not need to worry about the cryptographic details and can perform data analysis in a similar way as if you were working with cleartext data.
However, there are some fundamental differences, which will require a change in the usual data analysis workflow. In principle, all data is kept in encrypted (or more precisely, secret-shared) form until the desired outcome of an analysis, which is then only revealed to the authorized analyst. Here we will describe these changes as well as attempt to explain why these changes are necessary.
Different workflow¶
The goal of Roseman Labs is to provide a tool that allows to work with data that would be otherwise unaccessible. Often the data should be kept private, even from the analyst doing the computation. In other words, you should not be able to inspect the data. Of course, you should be able to extract some information from the data, otherwise the tool would not have much use, but this information should be controlled and minimized for the intended purpose of the analysis.
This differs from usual data analysis workflow, which often involves interactively browsing through data. Normally, an analyst will access the data before starting their analysis (for example, print the first rows of each table to understand how the data is structured). These actions are not allowed in crandas by design, as even glimpsing a couple of lines might reveal information that should be kept secret. Therefore data analysis in crandas can only be done without revealing data. Fortunately crandas provides tools that do not reveal individual records and others that ensure that the data fulfill a specific structure.
Accessing information about the data¶
While we cannot access the data directly, crandas provides multiple tools to access important information about the shape of the data.
The simplest one is printing the table, which will not output the data in the table but instead will show information about its dimensions and columns.
The types of columns, their names and the number of rows might not be enough to get enough of an idea about the data.
This is why crandas provides a number of exploratory methods, including the DataFrame.describe()
method that will show basic statistics of the columns, such as the mean and standard deviation.
More complex descriptive statistics can be found in Descriptive statistics.
It is important to note, however, that these operations still require approval to be executed and therefore cannot be done interactively. Crandas only reveals data when you explicitly ask for it [1], and the engine is authorized to fulfill the request. In certain contexts, the information revealed by these operations may already be sensitive, so the use of these functions must also be approved. Therefore, a fully interactive workflow will not be compatible with the use of crandas. On the other hand, it is possible to open intermediate results when designing scripts on dummy data. To get used to analyzing encrypted data it is a good exercise to reveal data only for the final output step of your analysis.
Table schemas¶
When uploading data to the engine through the Roseman Labs platform, it is possible to define and enforce certain characteristics that the data must follow. This is especially important because it hard to clean data that are already uploaded to the system (because also cleaning operations would need to be pre-approved, and we cannot inspect the data manually to see if it was properly cleaned). However, this rigidity when uploading data allows us to already know what to expect with regards to the structure of the data; potentially providing additional information such as minimum and maximum bounds. It also forces the data to be (relatively) clean, as the data must be cleaned up and reshaped before uploading it into the engine.
Domain separation¶
Crandas works with tabular data, in the same model that pandas uses.
It also works with Series
and DataFrame
.
The way you interact with them is essentially the same.
You can select a specific column or take a slice of the table in the same way.
However, a pandas DataFrame
and a crandas DataFrame
cannot interact directly.
For example, we cannot merge a DataFrame
in the engine and a pd.DataFrame
with data in the clear.
However, we can upload that pd.DataFrame
using upload_pandas_dataframe()
and then merge it in the engine.
Of course, this means that we can no longer access the data of our original pd.DataFrame
in the merged table, as it also has been encrypted.
Approval¶
The main difference between using pandas and crandas in practice is the need for approvals.
Because the data found in the engine is highly sensitive, all disclosures of it must be previously approved.
The Roseman Labs engine has two modes: design
and authorized
.
Design mode allows for every query to be executed with no prior approval but should not be used with any real data, only dummy data that has the same structure.
This mode is used to generate scripts that will be run in authorized mode after being approved.
Authorized mode, on the other hand, contains the real data and only queries that are approved can be executed there.
This includes the data exploration queries mentioned in the previous sections.
See the The Approval Workflow section for more information about how the approval process works.
Similarities between crandas and pandas¶
After all of these differences, you might wonder what is similar about pandas and crandas.
Crandas uses a similar tabular model as pandas.
It also works with DataFrames
and Series
(although we call ours CSeries
).
The function names, signatures and use are almost the same as in pandas.
You can select a specific column or take a slice of the table in the same way that you would before.
Many functionalities available in pandas function the same way in crandas.
However, in order to preserve the privacy-preserving properties, some small differences are necessary [2].
Most of the time, if you are able to do something in pandas, the way to do it in crandas will look almost the same.
Similarities with other packages¶
Beyond a DataFrame a library, crandas can compute machine learning algorithms, such as regressions and random forests. Instead of creating an API from scratch, we also mimic other popular python packages such as scikit-learn (for regressions) or SciPy (descriptive statistics). All of this with the goal of minimizing the friction of switching from cleartext data to encrypted data.