Tips and tricks
Here you will find some helpful points to make the most out of crandas while ensuring the privacy of your data.
- Use integers instead of strings, as they are more efficient
- Crandas commands should be outside conditional branches
- Avoid uploading the same table multiple times
- Do not upload unnecessary columns to the engine
- Implement proper thresholds on any aggregations and merge operations
- Restarting the kernel often fixes problems
- Do not use placeholders unless the collaboration trust model assumes honest analysis
- Script steps can be omitted or changed in authorized mode
- Computed objects might be removed from cache
- Validate if the engine version matches your crandas installation
- Drop unnecessary columns when they are not being used
- Split up large datasets before joining to prevent out-of-memory issues
Use integers instead of strings, as they are more efficient
The way crandas ensures data privacy is by processing data through Multi Party Computation (or MPC). MPC is a series of mathematical techniques and protocols that are mostly based in discrete mathematics. This means that out of all the types used in cranmera, integers are the closest to the building blocks needed in the backend. The main consequence of this is that integers are considerably more efficient than any other data type. For example, it takes only one share to represent any integer, but it takes one share per character of a string.
Therefore, if you are working with a lot of data, you might want to convert some string columns into integers before uploading them to the engine. This is especially useful if some of your string columns contain categorical data or just a few entries that repeat constantly. To do this, simply take your string column and assign a number starting from zero to each new entry. Now replace the column in your table with a column with the respective integers.
Of course, now you are left with the question of how to recover the information in the strings afterward. The first option is to keep a local copy of the assignment of strings to integers. If you want to share the data with other parties, you can get them that data directly. This works for categorical data, but what if you also need to keep those strings private? Easy. Just upload the integer-string assignment table to the engine! Now you can do all the analysis on the table with integers, making it more efficient. Once you are done and want to retrieve the data, just do a many-to-one join of the two tables and you will have your strings again!
Crandas commands should be outside conditional branches
In design mode, the queries for the engine get recorded in a certain way
and these queries must be executed in the exact same order in the
authorized mode. Therefore no variables or commands should be declared
inside a conditional statement, such as an if. Since the data used in
the two modes is not the same, the conditional statement can follow one
path when recorded in the design mode and a different one in the
authorized mode, leading to a mismatch in the query if the branch that
was recorded had a nested command that was not executed in authorized
modes, or vice versa. This will throw an error if the branch followed in
authorized mode is different from the one followed in design mode:
import crandas as cd
import random
cd.script.record()
merged_numbers = cd.DataFrame(
{"positive": ["yes", "no"],
"integer": ["no", "no"],
"values": [1.1, 2.5]}, auto_bounds=True)
random_number = random.randint(0,100)
if random_number == 0:
print(f"Random number is {random_number}")
else:
total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if total_amount <= 10:
print(f"{total_amount}: is less than 10.")
elif total_amount >= 20:
print(f"{total_amount}: is over 20.")
To help you avoid such problems, during script recording, crandas will
print a warning when a crandas function is called in a conditional
branch. See check_recording for more
information. For example, the above code produces the following warning:
code.py:16: ConditionalCallDetected: Possible conditional crandas call detected during
script recording. Crandas is called from a conditional 'if' statement at code.py:13.
Add comment #crandas-dontwarn to the conditional statement to disable this warning, or
see 'Crandas commands should be outside conditional branches' in the crandas User Guide
for more information.
This should work fine:
total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if random_number == 0:
print(f"Random number is {random_number}")
else:
if total_amount <= 10:
print(f"{total_amount}: is less than 10.")
elif total_amount >= 20:
print(f"{total_amount}: is over 20.")
Note that in the second case total_amount was defined outside of the
conditional branches.
Avoid uploading the same table multiple times
Each time you execute cd.upload_pandas_dataframe() a new table is
uploaded with a new handle. Even if you had already uploaded the same
table before, crandas will interpret this as a new table and therefore
you will have several copies of the same table filling up the engine. To
prevent running out of memory, please be aware of this.
Do not upload unnecessary columns to the engine
Unnecessary columns will only increase the processing time, as they must be encrypted and decrypted every time a function is performed on the data, especially when the datasets are large. Avoiding the uploading of columns that are not needed will not only improve the computation efficiency on the data, but it will also save up some storage space.
Implement proper thresholds on any aggregations and merge operations
To prevent disclosure of information about individual records being
processed, make sure to implement thresholds on any filter and
aggregation operation (e.g. sum or mean). For more information,
please check the guide for approvers.
Restarting the kernel often fixes problems
Sometimes the errors are cached in the kernel and even making the necessary changes and re-running the cell won't fix the error. Moreover, when saving a script in design mode or running it in authorized mode, it's best to start with a fresh kernel. For these reasons we recommend restarting the kernel in these situations and the running the cells again.
Using Jupyter
In jupyter, this can be done easily using the restart button.
Do not use placeholders unless the collaboration trust model assumes honest analysis
Placeholders like Any() are very useful for automating tasks, but they
should be used with extreme care. In some cases, it might be needed to
run an analysis multiple times with a slight variation, or maybe the
same analysis can be performed on tables with different handles. But for
this very reason it should be used (and approved) with discretion. We
recommend not to use or approve this functionality unless the trust
level between the involved parties is high enough to allow this, as it
could lead to disclosure risks and data leakage. For more information on
disclosure risks in an analysis, please check the
Guide for Approvers.
Script steps can be omitted or changed in authorized mode
While crandas is designed to work with encrypted data, and its functions have been designed to facilitate the use of this data while maintaining privacy, it is necessary to consider that a good script design and appropriate supervision before approval go hand in hand with these qualities to ensure that data privacy is maintained correctly. Crandas saves scripts in such a way that in authorized mode, if any of the steps that generate an output that is then the input for another method is not present, a mismatch error is generated between the approved query and the one being executed. However, if there is a step that is not essential for the continuity of the script, even if it fails or is replaced, the execution will continue. For example:
y = x.filter(threshold=10)
x.open() # even though x.filter() might fail, this command can still be executed
If the first line fails to execute in authorized mode, or if it is
replaced, the second line will still be executed. In this example, this
will allow the analyst to open x even if the threshold is not met.
Understanding this concept is important to ensure the correct design and supervision of scripts, thus preventing "gaps" in them that can later be misused.
Computed objects might be removed from cache
Objects that are computed (as opposed to uploaded using e.g.
DataFrame), are kept in
a cache at the engine server. Typically, these objects are removed from
the cache after a certain amount of time to reclaim memory.
For example:
>>> import crandas as cd
>>> uploaded = cd.DataFrame({"a": [1,2,3]}, auto_bounds=True) # Will not be purged
>>> computed = uploaded.assign(b = lambda x: x.a + 1)
>>> # ... one week later ...
>>> computed.open()
crandas.errors.ServerError: Object with handle FCBB3266F1914AEFAC7574D25708A57FE396E37115E47E8B9B959AA265DB0AD3 does not exist (HTTP 400 Bad Request, error code: ServerErrorCode.CLIENT_ERROR_HANDLE_NOT_FOUND)
>>> computed = uploaded.assign(b = lambda x: x.a + 1) # Re-compute object
>>> computed.open()
a b
0 1 2
1 2 3
2 3 4
>>> computed.save() # Prevent purging
>>> computed.open()
a b
0 1 2
1 2 3
2 3 4
In this example, uploaded is an uploaded object. Such an object will
not be purged from the server cache, and typically also remains
available after server restarts.
However, computed is a computed object, computed from uploaded. Such
an object may be purged from the server cache at some point, depending
on how the server is configured. For example, if computed.open() is
run a week after computed has been computed, the object may well have
been removed from the server cache. This can lead to an error message as
shown above. In this case, please re-compute the object by running the
original script.
To prevent purging of an object, it can be saved by using
.save(). Calling
this method will stop the object from being purged, treating it in the
same way as an upload.
If purging-related errors occur regularly, please contact the server administrator to change the server caching settings.
Verifying your crandas version
If you're experiencing issues with crandas or just want to confirm that you're using the latest version, it's useful to know how to check which version of crandas you're running. This is done with the following:
This version should match the version that is running on the engine, which can be checked via the platform.
Drop unnecessary columns when they are not being used
When operating on tables with many columns, it is possible that not all of them will be used for the analysis. In that case, it is often better to work with only the columns that will be used in the analysis, as this uses less memory and may speed up certain operations.
>>> tab
Handle: BDB64F349496E43789ADF6F46FAD89282B99227D48C50AAB360B68C320B62134 (design mode)
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 ... col37 col38 col39 col40 col41 col42 col43 col44 col45 col46 col47 col48 col49 col50
ctype int int int int int int int int int int int int int int ... int int int int int int int int int int int int int int
Size: 10000 rows x 50 columns (contents secret-shared; data size: 3.8 MiB)
>>> tab_lean = tab[["col1", "col2", "col3", "col7"]]
>>> tab_lean
Handle: 8FCD80DDB842AB5F5737880AE8B5FCFA2022F9AF48D5DA41EAB875AC02B810DF (design mode)
col1 col2 col3 col7
ctype int int int int
Size: 10000 rows x 4 columns (contents secret-shared; data size: 312.5 KiB)
As we can see, the new table occupies less memory.
Split up large datasets before joining to prevent out-of-memory issues
When merging large datasets causes out-of-memory issues on your environment, you might consider splitting up one the datasets into subsets. You can then merge those individual sub-datasets with the other one and concatenate the results: