Tips and tricks¶
Here you will find some helpful points to make the most out of crandas while ensuring the privacy of your data.
Implement proper thresholds on any aggregations and merge operations
Do not use placeholders unless the collaboration trust model assumes honest analysis
Be careful when referencing outcomes of previous steps in a script
Validate if the engine version matches your crandas installation
Use integers instead of strings, as they are more efficient¶
The way crandas ensures data privacy is by processing data through Multi Party Computation (or MPC). MPC is a series of mathematical techniques and protocols that are mostly based in discrete mathematics. This means that out of all the types used in cranmera, integers are the closest to the building blocks needed in the backend. The main consequence of this is that integers are considerably more efficient than any other data type. For example, it takes only one share to represent any integer, but it takes one share per character of a string.
Therefore, if you are working with a lot of data, you might want to convert some string columns into integers before uploading them to the engine. This is especially useful if some of your string columns contain categorical data or just a few entries that repeat constantly. To do this, simply take your string column and assign a number starting from zero to each new entry. Now replace the column in your table with a column with the respective integers.
Of course, now you are left with the question of how to recover the information in the strings afterward. The first option is to keep a local copy of the assignment of strings to integers. If you want to share the data with other parties, you can get them that data directly. This works for categorical data, but what if you also need to keep those strings private? Easy. Just upload the integer-string assignment table to the engine! Now you can do all the analysis on the table with integers, making it more efficient. Once you are done and want to retrieve the data, just do a many_to_one join of the two tables and you will have your strings again!
Crandas commands should be outside conditional branches¶
In design mode, the queries for the engine get recorded in a certain way and these queries must be executed in the exact same order in the authorized mode. Therefore no variables or commands should be declared inside a conditional statement, such as an if
.
Since the data used in the two modes is not the same, the conditional statement can follow one path when recorded in the design mode and a different one in the authorized mode, leading to a mismatch in the query if the branch that was recorded had a nested command that was not executed in authorized modes, or vice versa.
This will throw an error if the branch followed in authorized mode is different from the one followed in design mode:
import crandas as cd
import random
cd.script.record()
merged_numbers = cd.DataFrame(
{"positive": ["yes", "no"],
"integer": ["no", "no"],
"values": [1.1, 2.5]}, auto_bounds=True)
random_number = random.randint(0,100)
if random_number == 0:
print(f"Random number is {random_number}")
else:
total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if total_amount <= 10:
print(f"{total_amount}: is less than 10.")
elif total_amount >= 20:
print(f"{total_amount}: is over 20.")
To help you avoid such problems, during script recording, crandas will print a warning when a crandas function is called in a conditional branch. See crandas.check_recording for more information. For example, the above code produces the following warning:
code.py:16: ConditionalCallDetected: Possible conditional crandas call detected during
script recording. Crandas is called from a conditional 'if' statement at code.py:13.
Add comment #crandas-dontwarn to the conditional statement to disable this warning, or
see 'Crandas commands should be outside conditional branches' in the crandas User Guide
for more information.
This should work fine:
total_amount = ((merged_numbers['positive'] == 'yes') & (merged_numbers['integer'] == 'yes')).sum()
if random_number == 0:
print(f"Random number is {random_number}")
else:
if total_amount <= 10:
print(f"{total_amount}: is less than 10.")
elif total_amount >= 20:
print(f"{total_amount}: is over 20.")
Note that in the second case total_amount
was defined outside of the conditional branches.
Avoid uploading the same table multiple times¶
Each time you execute cd.upload_pandas_dataframe
a new table is uploaded with a new handle. Even if you had already uploaded the same table before, crandas will interpret this as a new table and therefore you will have several copies of the same table filling up the engine.
To prevent running out of memory, please be aware of this.
Do not upload unnecessary columns to the engine¶
Unnecessary columns will only increase the processing time, as they must be encrypted and decrypted every time a function is performed on the data, especially when the datasets are large. Avoiding the uploading of columns that are not needed will not only improve the computation efficiency on the data, but it will also save up some storage space.
Implement proper thresholds on any aggregations and merge operations¶
To prevent disclosure of information about individual records being processed, make sure to implement thresholds on any filter
and aggregation operation (e.g. sum
or mean
). For more information, please check the guide for appovers.
Restarting the kernel often fixes problems¶
Sometimes the errors are cached in the kernel and even making the necessary changes and re-running the cell won’t fix the error. Moreover, when saving a script in design mode or running it in authorized mode, it’s best to start with a fresh kernel. For these reasons we recommend restarting the kernel in these situations and the running the cells again. In jupyter, this can be done easily using the restart button.
Do not use placeholders unless the collaboration trust model assumes honest analysis¶
Placeholders like any()
are very useful for automating tasks, but they should be used with extreme care.
In some cases, it might be needed to run an analysis multiple times with a slight variation, or maybe the same analysis can be performed on tables with different handles.
But for this very reason it should be used (and approved) with discretion.
We recommend not to use or approve this functionality unless the trust level between the involved parties is high enough to allow this, as it could lead to disclosure risks and data leakage.
For more information on disclosure risks in an analysis, please check the guide for appovers.
Be careful when referencing outcomes of previous steps in a script¶
In a crandas script, you can either reference objects in the engine or specific values, but not results of operations or arbitrary values. In case you refer to the result of an earlier operation in a script or to an input of the analyst, the actual value of that result on the dummy data will be recorded.
Suppose we want to perform the same filter on two different datasets. The value it is filtered on does not matter, as long as the filter is identical for both tables. As an example, take the code snippet below:
tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")
val = 10 ## this could also be the result of a previous operation, e.g. tab2["col1"].mean()
# these filters will be recorded as `filter1 = tab1[tab1["col"] < 10]`
filter1 = tab1[tab1["col"] < val]
filter2 = tab2[tab2["col"] < val]
Now, when we would record this script, the filter would only be allowed on val == 10
. To overcome this, we can use a placeholder. When this script is recorded, the filter will be allowed on any value that is assigned to val
in authorized mode:
tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")
# When this script is recorded, any value can be assigned to `val` in authorized mode
val = cd.placeholders.Any(10)
filter1 = tab1[tab1["col"] < val]
filter2 = tab2[tab2["col"] < val]
Still, this will not guarantee that the value being filtered on is the same for both filters. The script for authorized mode could be manipulated as follows:
tab1 = cd.get_table("39EF406EFB07023D06A4D0051AC42F4FFDAD5CF7506560E74CDA12117C4D2433")
tab2 = cd.get_table("4FB1F73C7EE610521F480A767E28B4E867D7F71D24BDA550C1C6AE0FD796233A")
# When this script is recorded, any value can be assigned to `val` in authorized mode
val = cd.placeholders.Any(10)
filter1 = tab1[tab1["col"] < val]
# hack
val.value = 20
filter2 = tab2[tab2["col"] < val]
There are ways in which these types of scripts can be properly implemented. Please reach out to us when you need help on this!
Computed objects might be removed from cache¶
Objects that are computed (as opposed to uploaded using e.g. crandas.crandas.DataFrame()
), are kept in a cache at the engine server.
Depending on how the engine server is configured, such objects may be removed from the cache by the server(s) to reclaim memory.
For example:
>>> import crandas as cd
>>> uploaded = cd.DataFrame({"a": [1,2,3]}, auto_bounds=True) # Will not be purged
>>> computed = uploaded.assign(b = lambda x: x.a + 1)
>>> computed.open() # May be purged
RuntimeError: ("The PRSS keys that we tried to use were not present at the server (maybe they have been purged from the server cache?). We've reset the keys and they will be re-initialized, please retry the last command.", 'The PRSS keys used for table uploads were not found (handle 2E6BA04C39AA38C660E24DC395FDB5EBDAF169F3324D79D2F64C27C38775B260). Try re-initializing them.', 8)
>>> computed.open()
crandas.errors.ServerError: Object with handle FCBB3266F1914AEFAC7574D25708A57FE396E37115E47E8B9B959AA265DB0AD3 does not exist (HTTP 400 Bad Request, error code: ServerErrorCode.CLIENT_ERROR_HANDLE_NOT_FOUND)
>>> computed = uploaded.assign(b = lambda x: x.a + 1) # Re-compute object
>>> computed.open()
a b
0 1 2
1 2 3
2 3 4
>>> computed.save() # Prevent purging
>>> computed.open()
a b
0 1 2
1 2 3
2 3 4
In this example, uploaded
is an uploaded object. Such an object will not be purged from the server cache, and typically also remains available after server restarts.
However, computed
is a computed object, computed from uploaded
. Such an object may be purged from the server cache at some point, depending on how the server is configured.
For example, if computed.open()
is run a week after computed
has been computed, the object may well have been removed from the server cache.
This can lead to two types of error message:
In the first call to
computed.open()
in the example above, the upload/download keys were purged from the server cache. As the error indicates, crandas recovers from this error automatically by re-initializing the PRSS system used for uploading/downloading.In the second call to
computed.open()
in the example above, the computation result itself was removed from the server cache. In this case, please re-compute the object by running the original script.
To prevent purging of an object, it can be saved by using stateobject.StateObject.save()
.
Calling this method will stop the object from being purged, treating it in the same way as an upload.
If purging-related errors occur regularly, please contact the server administrator to change the server caching settings.
Verifying your crandas version¶
If you’re experiencing issues with crandas or just want to confirm that you’re using the latest version, it’s useful to know how to check which version of crandas you’re running. This is done with the following:
from crandas.base import session
session.version
This version should match the version that is running on the engine, which can be checked via the platform.