Approvals¶
Most of the material in this guide is applicable for when the VDL servers run in unauthorized mode.
In this mode, any query is allowed and can be immediately executed by the VDL.
This includes revealing all source data, with the table.open()
command, so this mode is mostly intended for environments that only contain dummy data.
In production environments, we recommend running the VDL in authorized mode. Here, all queries have to have prior approval by a fixed set of approvers before it can be executed.
Here, a query is a request for a single computation that crandas sends to the VDL servers. Every command in crandas, like a join or a concatenation for instance, works by sending one or more queries to the VDL. The VDL then distributes the query among all three servers, and if all servers agree that the query should be executed, they perform the query and send back the result of the query (usually a table) to the client.
In authorized mode, the VDL servers only accept the query if it is accompanied by list of digital signatures; one signature for each pre-configured approver. Approvers hold a secret signing key, that they use to generate the digital signatures. For each signing key there is a unique associated verification key, that may be distributed publicly. The VDL servers are configured with the verification keys of each approver, and they use the verification keys to check whether the signatures on the queries that they receive are valid.
The actual situation is more complicated, because besides the signatures of the approvers, the analyst that uses crandas also signs the query. This is because the approvers will only approve a query to be executed by a single designated analyst. To accomplish this, the user of crandas will generate an analyst key, which like the above also consists of two parts: a secret signing key, and a public verification key. The approvers include the verification key of the analyst in their approval. When crandas wants to execute a query, it first signs the query itself using the secret analyst key. It then sends the query, together with the signatures from the approvers, to the VDL server.
Recording an analysis¶
To record an analysis, the user calls crandas.script.record()
before executing any crandas commands.
Then, they may perform any number of crandas commands.
Since recording happens when the user is connected to the design environment, the commands will return normal output (on the dummy data).
After the analysis is complete, the user calls crandas.script.save()
or Script.save()
, which saves the recorded analysis as a JSON-formatted script file.
This file can then be submitted for approval.
An example:
import crandas as cd
import crandas.script
# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True)
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
script = script.save("tall-shapes.json")
The result is that the recorded script will be placed in a JSON-formatted
script file named tall-shapes.json
. The user can now upload this file using
the Portal to obtain approval.
Scripts are always approved to be executed by a designated analyst, that
holds a secret analyst key. By default, the current user’s
Session.analyst_key
is included in the recorded script.
Executing an approved analysis¶
Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis:
import crandas as cd
import crandas.script
# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True)
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
script.close()
Note that the commands of the analysis are exactly the same as when the script was recording, and they occur in exactly the same order. This is necessary for the script to work correctly.
It is therefore recommended when recording scripts from a Jupyter notebook to save the notebook and then use the “Run all” command to execute all cells in order.
Once the approved script has been obtained, the only commands that need to be modified are the cd.script.load
and script.save
commands, alongside with the configuration to point crandas to the correct server.
Dummy data¶
In the above example analysis, the computation consisted of an upload together with a filter command. Typically, uploads will not happen inside of the analysis, but rather an analysis will refer to data that was uploaded earlier. In the production environment, the analysis will refer to the real data, whereas in the design environment we use dummy data to explore.
As we have seen, we may refer to tables using their handle, which gets randomly generated on upload. An analysis may look like this:
import crandas as cd
import crandas.script
# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Calculate the mean stroke age")
# 180A66... and EFC8A9... are the handles of tables that were uploaded to the production environment
survey = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
medical = cd.get_table("EFC8A915781AE712856CA7F88AAED560A92D154BA6F78B90DCE36B287B9BC2AA")
merged = cd.merge(survey, medical, how="inner", left_on="patient_id", right_on="patientnr")
stroke_filter = merged["Stroke"] == 1
mean_stroke_age = merged[stroke_filter]["Age"].mean()
print(total_age)
script.save("mean-stroke-age.json")
The problem is that if the handles 180A66...
and EFC8A9...
refer to tables in the production environment, then when recording the script in the design environment, the get_table
functions will return an error, because the tables do not exist: if dummy data was uploaded for the survey
and medical
tables, they will have different random handles.
We could change the handles in the analysis to use the handles for the dummy data, but then the recorded script would have the wrong handles.
One solution is to use named tables, as we have seen before, where the table name can be the same in both the design and production environments.
A different approach using handles is to use the dummy_for
construct.
When uploading dummy data, for the analysis above, we use the following:
# This happens before the call to cd.script.record()
survey_clear = pd.read_csv("gs_data/health_data_survey.csv")
their_survey = pd.read_csv("gs_data/health_their_survey.csv")
survey = pd.concat([our_survey,their_survey])
cd.upload_pandas_dataframe(survey, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
Note that we uploaded the dummy data specifying the table handle used in the production environment.
We perform a similar thing for the medical
table.
Now, if we record the above analysis in the design environment, the script will record correctly!
While in recording mode [1], any calls to crandas.get_table()
will have their handle replaced by the dummy handle for the command result, however the real handle will still be present in the recorded script.
This means that when the script is later saved and approved, the same sequence of commands can be executed in the production environment.