Recording a script

Processing data in an authorized environment requires recording the script that needs to be executed. The recording is done in a design environment that contains dummy data, tables that have the same¹ structure as the real data but with fake entries. The script is executed in the design environment, creating a recording file that must be uploaded to the platform for approval. After a script is recorded and an approval is created, there are a couple of simple changes that need to be done in the authorized environment.

The first step when recording a script is to make sure that we are connected to a design environment. In such environment there are no restrictions on the computations that can be executed. In general, this requires connecting to the engine with the connection file that ends in -design.

import crandas as cd
cd.connect("environment-design")

The purpose of this environment is to design the script that will be run over the real data. This script will be recorded into a .recording file that must then be cryptographically signed by the approvers, typically through the platform.

Tip

You should only attempt to record the script after you have fully run and designed the analysis to be run.

To start recording a script, call cd.script.record():

# Adding a name for the analysis is not obligatory, but it is recommended.
# After recording, the recording file will be stored at the specified path.
cd.script.record("Analysis name", path="filename.recording")

After running the script that needs to be recorded, end the recording by calling the [.end()][crandas.script.Script.end] method on the script.

# After the script is finished, end the recording
script.end()

Note that if there is an issue in the recording of the script, you should restart the script recording from scratch instead of trying halfway. Otherwise certain steps might be recorded multiple times or not at all, meaning that the authorized script will not work. Call cd.script.record() to start the recording again.

Using Jupyter

When you are using a Jupyter notebook, can use Cells -> Run all (or Kernel -> Restart & Run all). This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename.

There are multiple technical details to take into account when recording a script. Follow along

Uploading dummy data
Linear execution and errors
Values and placeholders
Conditionals and loops

Part of the goal of recording a script is to see that you effectively get the data that you want to get out. For this to happen, there must be tables that look like the real data over which the script is run. Of course, these tables cannot contain real data, yet they must be connected to the tables in the authorized environment that do.

Uploading dummy data

In order to record a script, we need to have a table with the same structure as the table of real data over which the analysis will ultimately be performed. By the same structure, we mean that it has the same columns in the same order and each of these columns is of the appropriate type, including whether it contains missing values. In order to ensure this structure is correct, you can use table schemas. When uploading a table, use the schema argument to ensure that the table has the right properties. For example, if a column could have missing values, the schema can make this distinction even if the dummy data does not contain those values.

SCHEMA = {"string": "varchar", "nullable_string": "varchar?", "ranking": "int8"}

tab = cd.DataFrame(
    {
        "string": ["these", "are", "strings"],
        "nullable_string": ["so", "are", "these"],
        "ranking": [1, 2, 3],
    },
    schema=SCHEMA,
)

When checking the structure of tab we see that the nullable_string column is of type varchar? even if it has no missing values.

>>> tab
Handle: 52773671E3EE4B863460E6E991084D743B0CBDEC30B45C6D9489102EF1959931 (design mode)
        string nullable_string
ctype  varchar        varchar?
Size: 3 rows x 2 columns (contents secret-shared; data size: 360 bytes)

Having uploaded the table with dummy data, we need to associate it with the table with the real data that is found in the authorized environment. For that, we specify both the prod_handle and the dummy_handle in get_table(). Assuming that we saved the handle for the table with the real data in the PROD_HANDLE variable and the handle for the dummy data in the DUMMY_HANDLE variable, our analysis is as follows.

# Given a schema `SCHEMA`, the real handle `PROD_HANDLE` and the fake handle `DUMMY_HANDLE`
# The analysis would look like this

cd.script.record("Analysis name")

tab =  cd.get_table(prod_handle=PROD_HANDLE, dummy_handle=DUMMY_HANDLE, schema=SCHEMA)
# We get with `schema` to ensure that the table fulfills the schema we want

After uploading all the tables that will be part of the analysis, we can continue to design the script.

Note

Typically, table uploads are done outside of script recordings with the sensitive data uploaded through the platform.

Linear execution

Whenever a script is recorded and approved, it must be executed in the same way that it was recorded. In particular, the order of operations must be maintained. The engine enforces this linear execution of operations i.e. every operation must be computed in the same order.

As we will see later, this introduces difficulties when dealing with conditionals or loops.

Linear execution also means that execution of a script will not continue after an error. Because all of the steps must be successfully computed in order, when a step fails then the rest of the script cannot be computed.

cd.script.record("Analysis name")

tab =  cd.get_table(PROD_HANDLE, schema=SCHEMA)

# It is possible there is an error here if there are not enough instances where the value equals 5
tab.filter(tab["col1"]==5, threshold=3)

# If there is a threshold error, this following line will not run
grouping = tab.groupby("col1")

In most cases, this is the desired behaviour, as an error might suggest a problem in the analysis design and therefore it is better to err on the side of caution. However certain errors, like threshold violations, are unpredictable and you might want to continue your script after such an error appears. For that purpose, we provide the cd.script.allow_errors context.

cd.script.record("Analysis name")

tab =  cd.get_table(PROD_HANDLE, schema=SCHEMA)

# The part that might fail is put in a context that allows errors
with cd.script.allow_errors():
    tab.filter(tab["col1"]==5, threshold=3)

# This line will now run regardless of whether there is an error or not
grouping = tab.groupby("col1")

Important

Whenever you're using the cd.script.allow_errors context, it must appear in both the recording and the execution. Using it in one and not the other will result in an error.

Values and placeholders

Beyond having dummy data uploaded to the platform, a script might also use certain data that is either computed or comes from external sources. For example, given a column with a ranking from 1 to 10 we might want to filter out the responses with a value that is bigger than 5. In that case we would write the following:

new_tab = tab[tab["ranking"]>5]

This will work without an issue, but what if we want to compare it not to a number that is already known but to something that is computed from the data, like a mean?

# Here we compute the mean, which *opens* it
mean = tab["ranking"].mean()

# Now we filter based on the mean
new_tab = tab[tab["ranking"]>mean]
# This will not work, because the mean of the dummy data
# will be different than the real data

This example, where we open the mean of a column and then try to filter the table will not work when executing a recorded script. This is because when mean is opened it becomes a value that becomes hardcoded in the recording. For this particular case, the solution is to simply not open it. Every value that is unopened (as well as any table, grouping or model) that is created during a recording will correspond to the analogous object in the authorized environment.

# mode="regular" ensures the mean is not opened
mean = tab["ranking"].mean(mode="regular")

# This will now work
new_tab = tab[tab["ranking"]>mean]

# If we want to know the mean, we must open it
mean.open()

While this is the preferred way to deal with values, there could be a situation where the value is not known in advanced and not computed from the data. In this situation you can use placeholders. The Any placeholder does what its names suggests, it allows us to use any value in that place.

val = cd.placeholders.Any(5)

# In the authorized environment, `val` can be substituted by any value 
values_above = tab[tab["ranking"]>val]
values_below = tab[tab["ranking"]<val]

Placeholders allow for any possible value to be added in the execution in the authorized environment. Note that placeholders do not force a type, so in the previous example we could have val=5.5 which will be treated as a fixed-point number even when the original placeholder was an integer. However, a placeholder must maintain the same value every time it shows up. For example, if we recorded the previous code block, we could execute the following:

new_tab = tab[tab["ranking"]>4.2]
new_tab = tab[tab["ranking"]<4.2]

But we could not execute this, as the placeholder will have two different values:

new_tab = tab[tab["ranking"]>4.2]
new_tab = tab[tab["ranking"]<5]

Warning

Given an approved script with a placeholder, every distinct run of the script will allow a different value for a placeholder. You must try to minimize the use of placeholders in order to prevent this possibility.

Conditionals and loops

if statements and for loops are the usual tools of a programmer, however, you must be careful when using them in a recorded script. A script recording only records the calls that are made to the engine. Therefore, in cases where a conditional determines which operation or values will be used, the recording and the actual execution will not match. Here is a simple example:

if tab["ranking"].max(mode="open") > 6:
    tab["succeeded"] = 0
else:
    tab["succeeded"] = 1

Only one of the two branches will be executed when recording this script and this is not necessarily what will be executed when working with real data. In the given example table at the beginning of this article, this conditional will go into the else branch, but the real data could realistically go into the first branch. The execution would then fail because the recording and the execution will never coincide. When recording, this will output a ConditionalCallDetected warning. Depending of what is needed, a conditional call can be avoided by using .if_else(), when(),then(), thresholds or placeholders. For example, the previous example could be converted to:

tab["succeeded"] = (tab["ranking"].max(mode="regular")>6).if_else(1,0)

Note

Unlike the previous section, we cannot simply not open the value to use it in a conditional. This is because something like tab["ranking"].max(mode="regular") > 6 cannot be evaluated in an if block.

A similar issue can happen when we use a for loop. For example, see the following example:

# We group the table to get all distinct values of `ranking`
grouping = tab.groupby("ranking")
values = grouping.as_table()["ranking"]

for i in range(0,len(values)):
    # This iterates through rows and prints them
    print(values[i:i+1].open())

This code will make as many calls to the slicing and opening functions as there are values in the dummy data table. If this differs from the production table, the scripts will no longer coincide. This behavior also leads to a ConditionalCallDetected warning. In general, iteration over rows is not recommended because of this issue. Most operations are rowwise and do not require iterating.

There are sometimes where for loops are not problematic, like when we are iterating over columns instead of rows. While a ConditionalCallDetected warning will still be issued, because we know the column structure of the table and we know the dummy data table has the same schema, we can do the following:

# We iterate over columns
for col in tab.columns.cols:
    # Check if the column is an integer
    if (col.type=="i"):
        # We print the mean
        print(tab[col.name].mean(mode="open"))

This code prints the means of integer columns. While this is a for statement, it will always work because the structure of the dummy and real tables are the same. Note that we even use a conditional here, which will work for the same reason: it refers to the schema of the table (which is the same in design and in authorized mode).

Except for number of rows. ↩