crandas and pandas¶
pandas is a popular Python package that provides data structures and data analysis tools. To make analysis on encrypted data accessible, we created crandas. crandas is a Python package that uses roughly the same syntax as pandas but it delegates the computations to the Virtual Data Lake such that those computations are performed obliviously by the servers, using secure multiparty computation (MPC). In this way, you do not need to worry about the cryptographic details and can perform data analysis in (almost) the same way as if you were working with cleartext data.
crandas is a tool to perform computations on encrypted data that resides in the Virtual Data Lake. In principle, all data is kept in encrypted (or more precisely, secret-shared) form, until the desired outcome of an analysis, which is then only revealed to the authorized analyst. The tool allows combining data from multiple sources, and it also supports basic machine learning models such as logistic regression.
If you are already familiar with pandas, working with crandas should feel very familiar to you. If not, we will explain everything from scratch in the first steps.
However, due to the fact that computations are performed under encryption, there are some fundamental differences. Here we give a brief overview of the similarities and differences with pandas.
Similarities between crandas and pandas¶
crandas works with tabular data, in the same model that pandas uses. It also works with Series
and DataFrames
(we call ours CSeries
and CDataFrames
, though). The way you interact with them is essentially the same. You can select a specific column or take a slice of the table in the same way that you would before. Plus, you can turn any DataFrame
from pandas into a CDataFrame
and combine it with data you are accessing from other sources. Many functionalities available in pandas function the same way in crandas. However, in order to preserve the privacy-preserving properties, some small differences are necessary [1].
crandas allows for filtering tables, performing operations, data aggregation and even more complex data analyses, like logistic regressions. It is also possible to combine different tables, using standard operations like an inner join. When performing analyses, using crandas feels very similar to using pandas, even if what goes on behind the scenes is completely different.
Differences¶
The first difference is perhaps obvious: as you will be working with private data, you should not be able to inspect the data. Of course, you should be able to extract some information from the data, otherwise the tool would not have much use, but this information should be controlled and minimized for the intended purpose of the analysis.
The is why crandas only reveals data when you explicitly ask for it [2], and the Virtual Data Lake agrees to fulfill the request. In production environments the servers will only fulfill the request once all data owners agree that the information may be revealed and approve the request. By contrast, in demo environments, you are expected to work with dummy data, and generally have a blanket authorization for any query, including revealing all input data. crandas tries to prevent you from needing to inspect data, however, and to get used to analyzing encrypted data it is a good exercise to reveal data only for the final output step of your analysis.
The result is that printing a table in crandas will just give you information about its dimensions and the columns, and not the data itself. Because it is not possible to directly access data, you must make sure to clean up your data before doing any analysis. This is especially important as we can only do the simplest of cleaning functions on data that is already uploaded to the system (and we cannot inspect the data manually to see if it was properly cleaned).