Crandas for Data Scientists
Crandas takes its name from pandas, a popular Python package that provides data structures and data analysis tools. The goal of crandas is to provide a seamless experience for analysts who are familiar with pandas, as we use similar functions, names and structures. The main difference is that the operations are done in the Roseman Labs engine, which uses Multi-Party Computation (MPC) to do the computations over encrypted data. In this way, you do not need to worry about the cryptographic details and can perform data analysis in a similar way as if you were working with cleartext data.
However, there are some fundamental differences, which will require a change in the usual data analysis workflow. In principle, all data is kept in encrypted (or more precisely, secret-shared) form until the desired outcome of an analysis, which is then only revealed to the authorized analyst. Here we will describe these changes as well as attempt to explain why these changes are necessary.
Different workflow
The goal of Roseman Labs is to provide a tool that allows to work with data that would be otherwise inaccessible. Often the data should be kept private, even from the analyst doing the computation. In other words, you should not be able to inspect the data. Of course, you should be able to extract some information from the data, otherwise the tool would not have much use, but this information should be controlled and minimized for the intended purpose of the analysis.
This differs from usual data analysis workflow, which often involves interactively browsing through data. Normally, an analyst will access the data before starting their analysis (for example, print the first rows of each table to understand how the data is structured). These actions are not allowed in crandas by design, as even glimpsing a couple of lines might reveal information that should be kept secret. Therefore data analysis in crandas can only be done without revealing data. Fortunately crandas provides tools that do not reveal individual records and others that ensure that the data fulfill a specific structure.
Accessing information about the data
While we cannot access the data directly, crandas provides multiple
tools to access important information about the shape of the data. The
simplest one is printing the table, which will not output the data in
the table but instead will show information about its dimensions and
columns. The types of columns, their names and the number of rows might
not be enough to get enough of an idea about the data. This is why
crandas provides a number of exploratory methods, including the
DataFrame.describe() method that will
show basic statistics of the columns, such as the mean and standard
deviation. More complex descriptive statistics can be found here.
It is important to note, however, that these operations still require approval to be executed and therefore cannot be done interactively. Crandas only reveals data when you explicitly ask for it1, and the engine is authorized to fulfill the request. In certain contexts, the information revealed by these operations may already be sensitive, so the use of these functions must also be approved. Therefore, a fully interactive workflow will not be compatible with the use of crandas. On the other hand, it is possible to open intermediate results when designing scripts on dummy data. To get used to analyzing encrypted data it is a good exercise to reveal data only for the final output step of your analysis.
Table schemas
When uploading data to the engine through the Roseman Labs platform, it is possible to define and enforce certain characteristics that the data must follow. This is especially important because it hard to clean data that are already uploaded to the system (because also cleaning operations would need to be pre-approved, and we cannot inspect the data manually to see if it was properly cleaned). However, this rigidity when uploading data allows us to already know what to expect with regards to the structure of the data; potentially providing additional information such as minimum and maximum bounds. It also forces the data to be (relatively) clean, as the data must be cleaned up and reshaped before uploading it into the engine.
Domain separation
Crandas works with tabular data, in the same model that pandas uses. It
also works with Series and
DataFrame. The way you
interact with them is essentially the same. Except for one important
distinction: you cannot access the data.
However, a pandas DataFrame and a crandas
DataFrame cannot interact
directly. For example, we cannot merge a
DataFrame in the engine
and a pd.DataFrame with data in the clear. However, we can upload that
pd.DataFrame using upload_pandas_dataframe() and then merge it in the engine. Of course, this means that
we can no longer access the data of our original pd.DataFrame in the
merged table, as it also has been encrypted.
Approval
The main difference between using pandas and crandas in practice is the
need for approvals. Because the data found in the engine is highly
sensitive, all disclosures of it must be previously approved. The
Roseman Labs engine has two modes: design and authorized. Design
mode allows for every query to be executed with no prior approval but
should not be used with any real data, only dummy data that has the
same structure. This mode is used to generate scripts that will be run
in authorized mode after being approved. Authorized mode, on the other
hand, contains the real data and only queries that are approved can be
executed there. This includes the data exploration queries mentioned in
the previous sections. See the Approvals
section for more information about how the approval process works.
Similarities between crandas and pandas
After all of these differences, you might wonder what is similar about
pandas and crandas. Crandas uses a similar tabular model as pandas. It
also works with DataFrames
and Series (although we call ours CSeries).
The function names, signatures and use are almost the
same as in pandas. You can select a specific column or take a slice of
the table in the same way that you would before.
Many functionalities available in
pandas function the same way in crandas. However, in order to preserve
the privacy-preserving properties, some small differences are
necessary2. Most of the time, if you are able to do something in
pandas, the way to do it in crandas will look almost the same.
Similarities with other packages
Beyond a DataFrame a library, crandas can compute machine learning algorithms, such as regressions and random forests. Instead of creating an API from scratch, we also mimic other popular python packages such as scikit-learn (for regressions) or SciPy (descriptive statistics). All of this with the goal of minimizing the friction of switching from cleartext data to encrypted data.
-
By default, crandas reveals the following values: the number of rows of a table, the name and type of every column and any computation that reveals a single value, such as
sumormean. Anything else requires an explicit opening. ↩ -
If you want to see an example of this, check out the many-to-one join. ↩