Recording a script
Processing data in an authorized environment requires recording the
script that needs to be executed. The recording is done in a design
environment that contains dummy data, tables that have the same1
structure as the real data but with fake entries. The script is executed
in the design environment, creating a recording file that must be
uploaded to the platform for
approval.
After a script is recorded and an approval is created, there are a
couple of simple changes that need to be done in the authorized
environment.
The first step when recording a script is to make sure that we are
connected to a design environment. In such environment there are no
restrictions on the computations that can be executed. In general, this
requires connecting to the engine with the connection file that ends in
-design.
The purpose of this environment is to design the script that will be run
over the real data. This script will be recorded into a .recording
file that must then be cryptographically signed by the approvers,
typically through the platform.
Tip
You should only attempt to record the script after you have fully run and designed the analysis to be run.
To start recording a script, call cd.script.record():
# Adding a name for the analysis is not obligatory, but it is recommended.
# After recording, the recording file will be stored at the specified path.
cd.script.record("Analysis name", path="filename.recording")
After running the script that needs to be recorded, end the recording by calling the .end() method on the script.
Note that if there is an issue in the recording of the script, you should restart the script recording from scratch instead of trying halfway. Otherwise certain steps might be recorded multiple times or not at all, meaning that the authorized script will not work. Call cd.script.record() to start the recording again.
Using Jupyter
When you are using a Jupyter notebook, can use Cells -> Run all (or
Kernel -> Restart & Run all). This will start recording a script, run
the commands in order against in the design mode, and save the script to
the specified filename.
There are multiple technical details to take into account when recording a script. Follow along
Part of the goal of recording a script is to see that you effectively get the data that you want to get out. For this to happen, there must be tables that look like the real data over which the script is run. Of course, these tables cannot contain real data, yet they must be connected to the tables in the authorized environment that do.
Uploading dummy data
In order to record a script, we need to have a table with the same
structure as the table of real data over which the analysis will
ultimately be performed. By the same structure, we mean that it has
the same columns in the same order and each of these columns is of the
appropriate type, including whether it contains missing values. In order
to ensure this structure is correct, you can use
table schemas. When
uploading a table, use the schema argument to ensure that the table
has the right properties. For example, if a column could have missing
values, the schema can make this distinction even if the dummy data does
not contain those values.
SCHEMA = {"string": "varchar", "nullable_string": "varchar?", "ranking": "int8"}
tab = cd.DataFrame(
{
"string": ["these", "are", "strings"],
"nullable_string": ["so", "are", "these"],
"ranking": [1, 2, 3],
},
schema=SCHEMA,
)
When checking the structure of tab we see that the nullable_string
column is of type varchar? even if it has no missing values.
>>> tab
Handle: 52773671E3EE4B863460E6E991084D743B0CBDEC30B45C6D9489102EF1959931 (design mode)
string nullable_string
ctype varchar varchar?
Size: 3 rows x 2 columns (contents secret-shared; data size: 360 bytes)
Having uploaded the table with dummy data, we need to associate it with
the table with the real data that is found in the authorized
environment. For that, we specify both the prod_handle and the
dummy_handle in get_table().
Assuming that we saved the handle for the table with the real data in
the PROD_HANDLE variable and the handle for the dummy data in the
DUMMY_HANDLE variable, our analysis is as follows.
# Given a schema `SCHEMA`, the real handle `PROD_HANDLE` and the fake handle `DUMMY_HANDLE`
# The analysis would look like this
cd.script.record("Analysis name")
tab = cd.get_table(prod_handle=PROD_HANDLE, dummy_handle=DUMMY_HANDLE, schema=SCHEMA)
# We get with `schema` to ensure that the table fulfills the schema we want
After uploading all the tables that will be part of the analysis, we can continue to design the script.
Note
Typically, table uploads are done outside of script recordings with the sensitive data uploaded through the platform.
Linear execution
Whenever a script is recorded and approved, it must be executed in the same way that it was recorded. In particular, the order of operations must be maintained. The engine enforces this linear execution of operations i.e. every operation must be computed in the same order.
As we will see later, this introduces difficulties when dealing with conditionals or loops.
Linear execution also means that execution of a script will not continue after an error. Because all of the steps must be successfully computed in order, when a step fails then the rest of the script cannot be computed.
cd.script.record("Analysis name")
tab = cd.get_table(PROD_HANDLE, schema=SCHEMA)
# It is possible there is an error here if there are not enough instances where the value equals 5
tab.filter(tab["col1"]==5, threshold=3)
# If there is a threshold error, this following line will not run
grouping = tab.groupby("col1")
In most cases, this is the desired behaviour, as an error might suggest
a problem in the analysis design and therefore it is better to err on
the side of caution. However certain errors, like
threshold violations, are
unpredictable and you might want to continue your script after such an
error appears. For that purpose, we provide the
cd.script.allow_errors context.
cd.script.record("Analysis name")
tab = cd.get_table(PROD_HANDLE, schema=SCHEMA)
# The part that might fail is put in a context that allows errors
with cd.script.allow_errors():
tab.filter(tab["col1"]==5, threshold=3)
# This line will now run regardless of whether there is an error or not
grouping = tab.groupby("col1")
Important
Whenever you're using the cd.script.allow_errors
context, it must appear in both the recording and the
execution. Using it in one and not the other will result in an error.
Values and placeholders
Beyond having dummy data uploaded to the platform, a script might also use certain data that is either computed or comes from external sources. For example, given a column with a ranking from 1 to 10 we might want to filter out the responses with a value that is bigger than 5. In that case we would write the following:
This will work without an issue, but what if we want to compare it not to a number that is already known but to something that is computed from the data, like a mean?
# Here we compute the mean, which *opens* it
mean = tab["ranking"].mean()
# Now we filter based on the mean
new_tab = tab[tab["ranking"]>mean]
# This will not work, because the mean of the dummy data
# will be different than the real data
This example, where we open the mean of a column and then try to filter
the table will not work when executing a recorded script. This is
because when mean is opened it becomes a value that becomes hardcoded
in the recording. For this particular case, the solution is to simply
not open it. Every value that is unopened (as well as any table,
grouping or model) that is created during a recording will correspond to
the analogous object in the authorized environment.
# mode="regular" ensures the mean is not opened
mean = tab["ranking"].mean(mode="regular")
# This will now work
new_tab = tab[tab["ranking"]>mean]
# If we want to know the mean, we must open it
mean.open()
While this is the preferred way to deal with values, there could be a
situation where the value is not known in advanced and not computed from
the data. In this situation you can use placeholders. The
Any placeholder
does what its names suggests, it allows us to use any value in that
place.
val = cd.placeholders.Any(5)
# In the authorized environment, `val` can be substituted by any value
values_above = tab[tab["ranking"]>val]
values_below = tab[tab["ranking"]<val]
Placeholders allow for any possible value to be added in the execution
in the authorized environment. Note that placeholders do not force a
type, so in the previous example we could have val=5.5 which will be
treated as a fixed-point number even when the original placeholder was
an integer. However, a placeholder must maintain the same value every
time it shows up. For example, if we recorded the previous code block,
we could execute the following:
But we could not execute this, as the placeholder will have two different values:
Warning
Given an approved script with a placeholder, every distinct run of the script will allow a different value for a placeholder. You must try to minimize the use of placeholders in order to prevent this possibility.
Conditionals and loops
if statements and for loops are the usual tools of a programmer,
however, you must be careful when using them in a recorded script. A
script recording only records the calls that are made to the engine.
Therefore, in cases where a conditional determines which operation or
values will be used, the recording and the actual execution will not
match. Here is a simple example:
Only one of the two branches will be executed when recording this script
and this is not necessarily what will be executed when working with real
data. In the given example table at the beginning of this article, this
conditional will go into the else branch, but the real data could
realistically go into the first branch. The execution would then fail
because the recording and the execution will never coincide. When
recording, this will output a
ConditionalCallDetected
warning. Depending of what is needed, a conditional call can
be avoided by using .if_else(),
when(),then(), thresholds or
placeholders.
For example, the previous example could be converted to:
Note
Unlike the previous section, we cannot simply not open the value to use
it in a conditional. This is because something like
tab["ranking"].max(mode="regular") > 6 cannot be evaluated in an if
block.
A similar issue can happen when we use a for loop. For example, see
the following example:
# We group the table to get all distinct values of `ranking`
grouping = tab.groupby("ranking")
values = grouping.as_table()["ranking"]
for i in range(0,len(values)):
# This iterates through rows and prints them
print(values[i:i+1].open())
This code will make as many calls to the slicing and opening functions
as there are values in the dummy data table. If this differs from the
production table, the scripts will no longer coincide. This behavior
also leads to a
ConditionalCallDetected
warning. In general, iteration over rows is not recommended
because of this issue. Most operations are rowwise and do not require
iterating.
There are sometimes where for loops are not problematic, like when we
are iterating over columns instead of rows. While a
ConditionalCallDetected
warning will still be issued, because we know the column
structure of the table and we know the dummy data table has the same
schema, we can do the following:
# We iterate over columns
for col in tab.columns.cols:
# Check if the column is an integer
if (col.type=="i"):
# We print the mean
print(tab[col.name].mean(mode="open"))
This code prints the means of integer columns. While this is a for
statement, it will always work because the structure of the dummy and
real tables are the same. Note that we even use a conditional here,
which will work for the same reason: it refers to the schema of the
table (which is the same in design and in authorized mode).
-
Except for number of rows. ↩