Tips and tricks

Here you will find some helpful points to make the most out of crandas while ensuring the privacy of your data.

Use integers instead of strings, as they are more efficient
Crandas commands should be outside conditional branches
Avoid uploading the same table multiple times
Do not upload unnecessary columns to the engine
Implement proper thresholds on any aggregations and merge operations
Restarting the kernel often fixes problems
Do not use placeholders unless the collaboration trust model assumes honest analysis
Script steps can be omitted or changed in authorized mode
Computed objects might be removed from cache
Validate if the engine version matches your crandas installation
Drop unnecessary columns when they are not being used
Split up large datasets before joining to prevent out-of-memory issues

Use integers instead of strings, as they are more efficient

The way crandas ensures data privacy is by processing data through Multi Party Computation (or MPC). MPC is a series of mathematical techniques and protocols that are mostly based in discrete mathematics. This means that out of all the types used in cranmera, integers are the closest to the building blocks needed in the backend. The main consequence of this is that integers are considerably more efficient than any other data type. For example, it takes only one share to represent any integer, but it takes one share per character of a string.

Therefore, if you are working with a lot of data, you might want to convert some string columns into integers before uploading them to the engine. This is especially useful if some of your string columns contain categorical data or just a few entries that repeat constantly. To do this, simply take your string column and assign a number starting from zero to each new entry. Now replace the column in your table with a column with the respective integers.

Of course, now you are left with the question of how to recover the information in the strings afterward. The first option is to keep a local copy of the assignment of strings to integers. If you want to share the data with other parties, you can get them that data directly. This works for categorical data, but what if you also need to keep those strings private? Easy. Just upload the integer-string assignment table to the engine! Now you can do all the analysis on the table with integers, making it more efficient. Once you are done and want to retrieve the data, just do a many-to-one join of the two tables and you will have your strings again!

Crandas commands should be outside conditional branches

In design mode, the queries for the engine get recorded in a certain way and these queries must be executed in the exact same order in the authorized mode. Therefore no variables or commands should be declared inside a conditional statement, such as an if. Since the data used in the two modes is not the same, the conditional statement can follow one path when recorded in the design mode and a different one in the authorized mode, leading to a mismatch in the query if the branch that was recorded had a nested command that was not executed in authorized modes, or vice versa. This will throw an error if the branch followed in authorized mode is different from the one followed in design mode:

import crandas as cd
import random

cd.script.record()

merged_numbers = cd.DataFrame(
    {"positive": ["yes", "no"],
    "integer": ["no", "no"],
    "values": [1.1, 2.5]}, auto_bounds=True)

random_number = random.randint(0,100)

if random_number == 0:
    print(f"Random number is {random_number}")
else:
    total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
    if total_amount <= 10:
        print(f"{total_amount}: is less than 10.")
    elif total_amount >= 20:
        print(f"{total_amount}: is over 20.")

To help you avoid such problems, during script recording, crandas will print a warning when a crandas function is called in a conditional branch. See check_recording for more information. For example, the above code produces the following warning:

code.py:16: ConditionalCallDetected: Possible conditional crandas call detected during
script recording. Crandas is called from a conditional 'if' statement at code.py:13.
Add comment #crandas-dontwarn to the conditional statement to disable this warning, or
see 'Crandas commands should be outside conditional branches' in the crandas User Guide
for more information.

This should work fine:

total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if random_number == 0:
    print(f"Random number is {random_number}")
else:
    if total_amount <= 10:
        print(f"{total_amount}: is less than 10.")
    elif total_amount >= 20:
        print(f"{total_amount}: is over 20.")

Note that in the second case total_amount was defined outside of the conditional branches.

Avoid uploading the same table multiple times

Each time you execute cd.upload_pandas_dataframe() a new table is uploaded with a new handle. Even if you had already uploaded the same table before, crandas will interpret this as a new table and therefore you will have several copies of the same table filling up the engine. To prevent running out of memory, please be aware of this.

Do not upload unnecessary columns to the engine

Unnecessary columns will only increase the processing time, as they must be encrypted and decrypted every time a function is performed on the data, especially when the datasets are large. Avoiding the uploading of columns that are not needed will not only improve the computation efficiency on the data, but it will also save up some storage space.

Implement proper thresholds on any aggregations and merge operations

To prevent disclosure of information about individual records being processed, make sure to implement thresholds on any filter and aggregation operation (e.g. sum or mean). For more information, please check the guide for approvers.

Restarting the kernel often fixes problems

Sometimes the errors are cached in the kernel and even making the necessary changes and re-running the cell won't fix the error. Moreover, when saving a script in design mode or running it in authorized mode, it's best to start with a fresh kernel. For these reasons we recommend restarting the kernel in these situations and the running the cells again.

Using Jupyter

In jupyter, this can be done easily using the restart button.

Do not use placeholders unless the collaboration trust model assumes honest analysis

Placeholders like Any() are very useful for automating tasks, but they should be used with extreme care. In some cases, it might be needed to run an analysis multiple times with a slight variation, or maybe the same analysis can be performed on tables with different handles. But for this very reason it should be used (and approved) with discretion. We recommend not to use or approve this functionality unless the trust level between the involved parties is high enough to allow this, as it could lead to disclosure risks and data leakage. For more information on disclosure risks in an analysis, please check the Guide for Approvers.

Script steps can be omitted or changed in authorized mode

While crandas is designed to work with encrypted data, and its functions have been designed to facilitate the use of this data while maintaining privacy, it is necessary to consider that a good script design and appropriate supervision before approval go hand in hand with these qualities to ensure that data privacy is maintained correctly. Crandas saves scripts in such a way that in authorized mode, if any of the steps that generate an output that is then the input for another method is not present, a mismatch error is generated between the approved query and the one being executed. However, if there is a step that is not essential for the continuity of the script, even if it fails or is replaced, the execution will continue. For example:

y = x.filter(threshold=10)
x.open()  # even though x.filter() might fail, this command can still be executed

If the first line fails to execute in authorized mode, or if it is replaced, the second line will still be executed. In this example, this will allow the analyst to open x even if the threshold is not met.

Understanding this concept is important to ensure the correct design and supervision of scripts, thus preventing "gaps" in them that can later be misused.

Computed objects might be removed from cache

Objects that are computed (as opposed to uploaded using e.g. DataFrame), are kept in a cache at the engine server. Typically, these objects are removed from the cache after a certain amount of time to reclaim memory.

For example:

>>> import crandas as cd
>>> uploaded = cd.DataFrame({"a": [1,2,3]}, auto_bounds=True) # Will not be purged
>>> computed = uploaded.assign(b = lambda x: x.a + 1)
>>> # ... one week later ...
>>> computed.open()
crandas.errors.ServerError: Object with handle FCBB3266F1914AEFAC7574D25708A57FE396E37115E47E8B9B959AA265DB0AD3 does not exist (HTTP 400 Bad Request, error code: ServerErrorCode.CLIENT_ERROR_HANDLE_NOT_FOUND)
>>> computed = uploaded.assign(b = lambda x: x.a + 1)        # Re-compute object
>>> computed.open()
a  b
0  1  2
1  2  3
2  3  4
>>> computed.save()                                          # Prevent purging
>>> computed.open()
a  b
0  1  2
1  2  3
2  3  4

In this example, uploaded is an uploaded object. Such an object will not be purged from the server cache, and typically also remains available after server restarts.

However, computed is a computed object, computed from uploaded. Such an object may be purged from the server cache at some point, depending on how the server is configured. For example, if computed.open() is run a week after computed has been computed, the object may well have been removed from the server cache. This can lead to an error message as shown above. In this case, please re-compute the object by running the original script.

To prevent purging of an object, it can be saved by using .save(). Calling this method will stop the object from being purged, treating it in the same way as an upload.

If purging-related errors occur regularly, please contact the server administrator to change the server caching settings.

Verifying your crandas version

If you're experiencing issues with crandas or just want to confirm that you're using the latest version, it's useful to know how to check which version of crandas you're running. This is done with the following:

from crandas.base import session

session.version

This version should match the version that is running on the engine, which can be checked via the platform.

Drop unnecessary columns when they are not being used

When operating on tables with many columns, it is possible that not all of them will be used for the analysis. In that case, it is often better to work with only the columns that will be used in the analysis, as this uses less memory and may speed up certain operations.

>>> tab
Handle: BDB64F349496E43789ADF6F46FAD89282B99227D48C50AAB360B68C320B62134 (design mode)
    col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14  ... col37 col38 col39 col40 col41 col42 col43 col44 col45 col46 col47 col48 col49 col50
ctype  int  int  int  int  int  int  int  int  int   int   int   int   int   int  ...   int   int   int   int   int   int   int   int   int   int   int   int   int   int
Size: 10000 rows x 50 columns (contents secret-shared; data size: 3.8 MiB)

>>> tab_lean = tab[["col1", "col2", "col3", "col7"]]
>>> tab_lean
Handle: 8FCD80DDB842AB5F5737880AE8B5FCFA2022F9AF48D5DA41EAB875AC02B810DF (design mode)
    col1 col2 col3 col7
ctype  int  int  int  int
Size: 10000 rows x 4 columns (contents secret-shared; data size: 312.5 KiB)

As we can see, the new table occupies less memory.

Split up large datasets before joining to prevent out-of-memory issues

When merging large datasets causes out-of-memory issues on your environment, you might consider splitting up one the datasets into subsets. You can then merge those individual sub-datasets with the other one and concatenate the results:

merge = cd.merge(cdf1,cdf2,on=['col1'],how='inner')
### Causes OutOfMemoryError

### split up cdf1
cdf1_slice1 = cdf1[:50000]
cdf1_slice2 = cdf1[50000:]


m1 = cd.merge(cdf1_slice1,cdf2,on=['col1'],how='inner')
m2 = cd.merge(cdf1_slice2,cdf2,on=['col1'],how='inner')
merge = cd.concat([m1,m2])