Tips and tricks#

Here you will find some helpful points to make the most out of crandas while ensuring the privacy of your data.

  1. Use integers instead of strings, as they are more efficient.

  2. Crandas commands should be outside conditional branches.

  3. Avoid uploading the same table multiple times.

  4. Do not upload unnecessary columns to the Virtual Data Lake.

  5. Implement proper thresholds on any aggregations and merge operations.

  6. Verify which version of crandas you are using in design & production, to make sure they are the same.

  7. Restarting the kernel often fixes problems.

  8. Do not use placeholders unless the collaboration trust model assumes honest analysis.

  9. Script steps can be omitted or changed in production.

  10. Be careful when referencing outcomes of previous steps in a script.

  11. Computed objects might be removed from cache

1. Use integers instead of strings, as they are more efficient.#

The way crandas ensures data privacy is by processing data through Multi Party Computation (or MPC). MPC is a series of mathematical techniques and protocols that are mostly based in discrete mathematics. This means that out of all the types used in cranmera, integers are the closest to the building blocks needed in the backend. The main consequence of this is that integers are considerably more efficient than any other data type. For example, it takes only one share to represent any integer, but it takes one share per character of a string.

Therefore, if you are working with a lot of data, you might want to convert some string columns into integers before uploading them to the Virtual Data Lake. This is especially useful if some of your string columns contain categorical data or just a few entries that repeat constantly. To do this, simply take your string column and assign a number starting from zero to each new entry. Now replace the column in your table with a column with the respective integers.

Of course, now you are left with the question of how to recover the information in the strings afterward. The first option is to keep a local copy of the assignment of strings to integers. If you want to share the data with other parties, you can get them that data directly. This works for categorical data, but what if you also need to keep those strings private? Easy. Just upload the integer-string assignment table to the Virtual Data Lake! Now you can do all the analysis on the table with integers, making it more efficient. Once you are done and want to retrieve the data, just do a left join of the two tables and you will have your strings again!

2. Crandas commands should be outside conditional branches.#

As the queries for the Virtual Data Lake get recorded in a certain way in design and should be performed in the exact same order in production, no variables or commands should be declared inside a conditional statement (e.g. an if). Since the data used in design and production is not the same, the conditional statement can follow one path when recorded in design and a different one in production, leading to a mismatch in the query if the branch that was recorded had a nested command that was not executed in production, or vice versa. This will throw an error if the branch followed in production is different from the one followed in design:

if random_number == 0:
    print(f"Random number is {random_number}")
else:
    total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
    if total_amount <= 10:
        print(f"{total_amount}: is less than 10.")
    elif total_amount >= 20:
        print(f"{total_amount}: is over 20.")

This should work fine:

total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if random_number == 0:
    print(f"Random number is {random_number}")
else:
    if total_amount <= 10:
        print(f"{total_amount}: is less than 10.")
    elif total_amount >= 20:
        print(f"{total_amount}: is over 20.")

Note that in the second case total_amount was defined outside of the conditional branches.

3. Avoid uploading the same table multiple times.#

Each time you execute cd.upload_pandas_dataframe a new table is uploaded with a new handle. Even if you had already uploaded the same table before, crandas will interpret this as a new table and therefore you will have several copies of the same table filling up the Virtual Data Lake. To prevent running out of memory, please be aware of this.

4. Do not upload unnecessary columns to the Virtual Data Lake.#

Unnecessary columns will only increase the processing time, as they must be encrypted and decrypted every time a function is performed on the data, especially when the datasets are large. Avoiding the uploading of columns that are not needed will not only improve the computation efficiency on the data, but it will also save up some storage space.

5. Implement proper thresholds on any aggregations and merge operations.#

To prevent disclosure of information about individual records being processed, make sure to implement thresholds on any filter and aggregation operation (e.g. sum or mean). For more information, please check the guide for appovers.

6. Verify if crandas versions match in design/production#

It is good practice to confirm that you are running the same version of crandas in design and production. If they are misaligned, this may cause problems as some versions might not be compatible with each other.

To check the version of crandas, type the following in the terminal:

pip show crandas

Also, the version of crandas should match the version of the Virtual Data Lake session that you are connected to. Checking the version of the Virtual Data Lake with your local python editor is very easy and you can execute this command in design and production outside of a recorded script.

from crandas.base import session
session.version

7. Restarting the kernel often fixes problems.#

Sometimes the errors are cached in the kernel and even making the necessary changes and re-running the cell won’t fix the error. Moreover, when saving a script in design or running it in production, it’s best to start with a fresh kernel. For these reasons we recommend restarting the kernel in these situations and the running the cells again. In jupyter, this can be done easily using the restart button.

8. Do not use placeholders unless the collaboration trust model assumes honest analysis.#

Placeholders like any() are very useful for automating tasks, but they should be used with extreme care. In some cases, it might be needed to run an analysis multiple times with a slight variation, or maybe the same analysis can be performed on tables with different handles. But for this very reason it should be used (and approved) with discretion. We recommend not to use or approve this functionality unless the trust level between the involved parties is high enough to allow this, as it could lead to disclosure risks and data leakage. For more information on disclosure risks in an analysis, please check the guide for appovers.

9. Script steps can be omitted or changed in production.#

While crandas is designed to work with encrypted data, and its functions have been designed to facilitate the use of this data while maintaining privacy, it is necessary to consider that a good script design and appropriate supervision before approval go hand in hand with these qualities to ensure that data privacy is maintained correctly. Crandas saves scripts in such a way that in production, if any of the steps that generate an output that is then the input for another method is not present, a mismatch error is generated between the approved query and the one being executed. However, if there is a step that is not essential for the continuity of the script, even if it fails or is replaced, the execution will continue. For example:

y = x.filter(threshold=10)
x.open()  # even though x.filter() might fail, this command can still be executed

If the first line fails to execute in production, or if it is replaced, the second line will still be executed. In this example, this will allow the analyst to open x even if the threshold is not met.

Understanding this concept is important to ensure the correct design and supervision of scripts, thus preventing “gaps” in them that can later be misused.

10. Be careful when referencing outcomes of previous steps in a script.#

In a crandas script, you can either reference objects in the Virtual Data Lake or specific values, but not results of operations or arbitrary values. In case you refer to the result of an earlier operation in a script or to an input of the analyst, the actual value of that result on the dummy data will be recorded.

Suppose we want to perform the same filter on two different datasets. The value it is filtered on does not matter, as long as the filter is identical for both tables. As an example, take the code snippet below:

tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")

val = 10 ## this could also be the result of a previous operation, e.g. tab2["col1"].mean()

# these filters will be recorded as `filter1 = tab1[tab1["col"] < 10]`
filter1 = tab1[tab1["col"] < val]
filter2 = tab2[tab2["col"] < val]

Now, when we would record this script, the filter would only be allowed on val == 10. To overcome this, we can use a placeholder. When this script is recorded, the filter will be allowed on any value that is assigned to val in production:

tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")

# When this script is recorded, any value can be assigned to `val` in production
val = cd.placeholders.Any(10)

filter1 = tab1[tab1["col"] < val]
filter2 = tab2[tab2["col"] < val]

Still, this will not guarantee that the value being filtered on is the same for both filters. The production script could be manipulated as follows:

tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")

# When this script is recorded, any value can be assigned to `val` in production
val = cd.placeholders.Any(10)

filter1 = tab1[tab1["col"] < val]

# hack
val.value = 20
filter2 = tab2[tab2["col"] < val]

There are ways in which these types of scripts can be properly implemented. Please reach out to us when you need help on this!

11. Computed objects might be removed from cache#

Objects that are computed (as opposed to uploaded using e.g. crandas.crandas.DataFrame()), are kept in a cache at the VDL server. Depending on how the VDL server is configured, such objects may be removed from the cache by the server(s) to reclaim memory.

For example:

>>> import crandas as cd
>>> uploaded = cd.DataFrame({"a": [1,2,3]}, auto_bounds=True) # Will not be purged
>>> computed = uploaded.assign(b = lambda x: x.a + 1)
>>> computed.open()                                           # May be purged
RuntimeError: ("The PRSS keys that we tried to use were not present at the server (maybe they have been purged from the server cache?). We've reset the keys and they will be re-initialized, please retry the last command.", 'The PRSS keys used for table uploads were not found (handle 2E6BA04C39AA38C660E24DC395FDB5EBDAF169F3324D79D2F64C27C38775B260). Try re-initializing them.', 8)
>>> computed.open()
crandas.errors.ServerError: Object with handle FCBB3266F1914AEFAC7574D25708A57FE396E37115E47E8B9B959AA265DB0AD3 does not exist (HTTP 400 Bad Request, error code: ServerErrorCode.CLIENT_ERROR_HANDLE_NOT_FOUND)
>>> computed = uploaded.assign(b = lambda x: x.a + 1)        # Re-compute object
>>> computed.open()
a  b
0  1  2
1  2  3
2  3  4
>>> computed.save()                                          # Prevent purging
>>> computed.open()
a  b
0  1  2
1  2  3
2  3  4

In this example, uploaded is an uploaded object. Such an object will not be purged from the server cache, and typically also remains available after server restarts.

However, computed is a computed object, computed from uploaded. Such an object may be purged from the server cache at some point, depending on how the server is configured. For example, if computed.open() is run a week after computed has been computed, the object may well have been removed from the server cache. This can lead to two types of error message:

  • In the first call to computed.open() in the example above, the upload/download keys were purged from the server cache. As the error indicates, crandas recovers from this error automatically by re-initializing the PRSS system used for uploading/downloading.

  • In the second call to computed.open() in the example above, the computation result itself was removed from the server cache. In this case, please re-compute the object by running the original script.

To prevent purging of an object, it can be saved by using stateobject.StateObject.save(). Calling this method will stop the object from being purged, treating it in the same way as an upload.

If purging-related errors occur regularly, please contact the server administrator to change the server caching settings.