Recording a Script¶
Processing data in an authorized environment requires recording the script that needs to be executed.
The recording is done in a design environment that contains dummy data, tables that have the same [1] structure as the real data but with fake entries.
The script is executed in the design environment, creating a recording
file that must be uploaded to the platform for approval.
After a script is recorded and an approval is created, there are a couple of simple changes that need to be done in the authorized environment.
The first step when recording a script is to make sure that we are connected to a design environment.
In such environment there are no restrictions on the computations that can be executed.
In general, this requires connecting to the engine with the connection file that ends in -design
.
import crandas as cd
cd.connect("environment-design")
The purpose of this environment is to design the script that will be run over the real data.
This script will be recorded into a .recording
file that must then be cryptographically signed by the approvers, typically through the platform.
Tip
You should only attempt to record the script after you have fully run and designed the analysis to be run.
To start recording a script, you must do this:
# Adding a name for the analysis is not obligatory, but it is recommended
script = cd.script.record("Analysis name")
After running the script that needs to be recorded a file is created by calling the save
method while specifying the filename for the recorded script.
# After the script is finished, create the recording file
script.record("filename.recording")
Note that if there is an issue in the recording of the script, you should restart the script recording from scratch instead of trying halfway.
In that case certain steps might be recorded multiple times or not at all, meaning that the authorized script will not work.
Call script.record()
to start the recording again.
Using Jupyter
When you are using a Jupyter notebook, can use Cells -> Run all
(or Kernel -> Restart & Run all
).
This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename.
Part of the goal of recording a script is to see that you effectively get the data that you want to get out. For this to happen, there must be tables that look like the real data over which the script is run. Of course, these tables cannot contain real data, yet they must be connected to the tables in the authorized environment that do.
Uploading dummy data¶
In order to record a script, we need to have a table with the same structure as the table of real data over which the analysis will ultimately be performed.
By the same structure, we mean that it has the same columns in the same order and each of these columns is of the appropriate type, including whether it contains missing values.
In order to ensure this structure is correct, you can use table schemas.
When uploading a table, use the schema
argument to ensure that the table has the right properties.
For example, if a column could have missing values, the schema can make this distinction even if the dummy data does not contain those values.
SCHEMA = {"string": "varchar", "nullable_string": "varchar?", "ranking": "int8"}
tab = cd.DataFrame(
{
"string": ["these", "are", "strings"],
"nullable_string": ["so", "are", "these"],
"ranking": [1, 2, 3],
},
schema=SCHEMA,
)
When checking the structure of tab
we see that the nullable_string
column is of type varchar?
even if it has no missing values.
>>> tab
Handle: 52773671E3EE4B863460E6E991084D743B0CBDEC30B45C6D9489102EF1959931 (design mode)
string nullable_string
ctype varchar varchar?
Size: 3 rows x 2 columns (contents secret-shared; data size: 360 bytes)
Having uploaded the table with dummy data, we need to associate it with the table with the real data that is found in the authorized environment.
For that, we use the crandas.DataFrame.save()
method with the dummy_for
keyword.
Assuming that we saved the handle for the table with the real data in the PROD_HANDLE
variable, we do the following:
>>> tab.save(dummy_for=PROD_HANDLE)
This will associate the table that we uploaded with the table with the real data found in the authorized environment.
Note that the process of uploading dummy data must be done outside of the recording.
Once we are inside the recording, we must access the table with crandas.get_table()
.
# Given a pandas DataFrame `df` with the data, a schema `SCHEMA`
# and the real handle `PROD_HANDLE`
# The process would look like this
dummy_tab = cd.upload_pandas_dataframe(df, schema=SCHEMA)
dummy_tab.save(dummy_for=PROD_HANDLE)
cd.script.record("Analysis name")
# Note that we use the authorized handle and not the dummy one
tab = cd.get_table(PROD_HANDLE, schema=SCHEMA)
# We get with `schema` to ensure that the table fulfills the schema we want
After uploading all the tables that will be part of the analysis, we can continue to design the script.
Note
Typically, table uploads are done outside of script recordings with the sensitive data uploaded through the platform.
Values and placeholders¶
Beyond having dummy data uploaded to the platform, a script might also use certain data that is either computed or comes from external sources. For example, given a column with a ranking from 1 to 10 we might want to filter out the responses with a value that is bigger than 5. In that case we would write the following:
new_tab = tab[tab["ranking"]>5]
This will work without an issue, but what if we want to compare it not to a number that is already known but to something that is computed from the data, like a mean?
# Here we compute the mean, which *opens* it
mean = tab["ranking"].mean()
# Now we filter based on the mean
new_tab = tab[tab["ranking"]>mean]
# This will not work, because the mean of the dummy data
# will be different than the real data
This example, where we open the mean of a column and then try to filter the table will not work when executing a recorded script.
This is because when mean
is opened it becomes a value that becomes hardcoded in the recording.
For this particular case, the solution is to simply not open it.
Every value that is unopened (as well as any table, grouping or model) that is created during a recording will correspond to the analogous object in the authorized environment.
# mode="regular" ensures the mean is not opened
mean = tab["ranking"].mean(mode="regular")
# This will now work
new_tab = tab[tab["ranking"]>mean]
# If we want to know the mean, we must open it
mean.open()
While this is the preferred way to deal with values, there could be a situation where the value is not known in advanced and not computed from the data.
In this situation you can use placeholders.
The Any
placeholder does what its names suggests, it allows us to use any value in that place.
val = cd.placeholders.Any(5)
# In the authorized environment, `val` can be substituted by any value
values_above = tab[tab["ranking"]>val]
values_below = tab[tab["ranking"]<val]
Placeholders allow for any possible value to be added in the execution in the authorized environment.
Note that placeholders do not force a type, so in the previous example we could have val=5.5
which will be treated as a fixed-point number even when the original placeholder was an integer.
However, a placeholder must maintain the same value every time it shows up.
For example, if we recorded the previous code block, we could execute the following:
new_tab = tab[tab["ranking"]>4.2]
new_tab = tab[tab["ranking"]<4.2]
But we could not execute this, as the placeholder will have two different values:
new_tab = tab[tab["ranking"]>4.2]
new_tab = tab[tab["ranking"]<5]
Warning
Given an approved script with a placeholder, every distinct run of the script will allow a different value for a placeholder. You must try to minimize the use of placeholders in order to prevent this possibility.
Conditionals and loops¶
if
statements and for
loops are the usual tools of a programmer, however, you must be careful when using them in a recorded script.
A script recording only records the calls that are made to the engine.
Therefore, in cases where a conditional determines which operation or values will be used, the recording and the actual execution will not match.
Here is a simple example:
if tab["ranking"].max(mode="open") > 6:
tab["succeeded"] = 0
else:
tab["succeeded"] = 1
Only one of the two branches will be executed when recording this script and this is not necessarily what will be executed when working with real data.
In the given example table at the beginning of this article, this conditional will go into the else
branch, but the real data could realistically go into the first branch.
The execution would then fail because the recording and the execution will never coincide.
When recording, this will output a ConditionalCallDetected
warning.
Depending of what is needed, a conditional call can be avoided by using CSeries.if_else()
, when().then()
, thresholds or placeholders.
For example, the previous example could be converted to:
tab["succeeded"] = (tab["ranking"].max(mode="regular")>6).if_else(1,0)
Note
Unlike the previous section, we cannot simply not open the value to use it in a conditional.
This is because something like tab["ranking"].max(mode="regular") > 6
cannot be evaluated in an if
block.
A similar issue can happen when we use a for
loop.
For example, see the following example:
# We group the table to get all distinct values of `ranking`
grouping = tab.groupby("ranking")
values = grouping.as_table()["ranking"]
for i in range(0,len(values)):
# This iterates through rows and prints them
print(values[i:i+1].open())
This code will make as many calls to the slicing and opening functions as there are values in the dummy data table.
If this differs from the production table, the scripts will no longer coincide.
This behavior also leads to a ConditionalCallDetected
warning.
In general, iteration over rows is not recommended because of this issue.
Most operations are rowwise and do not require iterating.
There are sometimes where for loops are not problematic, like when we are iterating over columns instead of rows.
While a ConditionalCallDetected
warning will still be issued, because we know the column structure of the table and we know the dummy data table has the same schema, we can do the following:
# We iterate over columns
for col in tab.columns.cols:
# Check if the column is an integer
if (col.type=="i"):
# We print the mean
print(tab[col.name].mean(mode="open"))
This code prints the means of integer columns.
While this is a for
statement, it will always work because the structure of the dummy and real tables are the same.
Note that we even use a conditional here, which will work for the same reason: it refers to the schema of the table (which is the same in design and in authorized mode).