Authorization¶
Most of the material in this guide is applicable for when the VDL servers run in unauthorized mode. In this mode, any query [1] is allowed and can be immediately executed by the VDL, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for environments that only contain dummy data.
In production environments, we recommend running the VDL in authorized mode. In this mode, all queries require prior approval by a fixed set of approvers before it can be executed.
Signatures¶
In authorized mode, the VDL servers only accept the query if it is accompanied by list of digital signatures; one signature for each pre-configured approver. Approvers hold a secret signing key, that they use to generate the digital signatures. For each signing key there is a unique associated verification key, that may be distributed publicly. The VDL servers are configured with the verification keys of each approver, and they use the verification keys to check whether the signatures on the queries that they receive are valid.
Besides the signatures of the approvers, there is a second level of signatures present, because the analyst that uses crandas also signs the query. The reason behind this is to ensure that only the intended analyst can execute the approved query and obtain its results. To accomplish this, the user of crandas will generate an analyst key, which like the above also consists of two parts: a secret signing key, and a public verification key. The approvers include the verification key of the analyst in their approval. When crandas wants to execute a query, it first signs the query itself using the secret analyst key. It then sends the query, together with the signatures from the approvers, to the VDL server.
Generating keys¶
To execute queries in authorized mode, queries are signed twice: by the approvers who approve the query, and by the analyst who executes them. Both approvers and analysts sign the query using their secret signing key. To each signing key there is a corresponding verification key, that may be distributred publicly, and that can be used to verify the digital signature.
To generate a signing key and corresponding verification key, the user [2] may use the generate_key.py
script that comes with crandas.
To use it from the command-line, use:
generate_key.py analyst.sk analyst.pk
This will generate two files, analyst.sk
(signing key) and analyst.pk
(public verification key) in the current working directory.
For the signing keys and verification keys, there is no distinction between analysts and approvers. If a user takes on both roles of analyst and approver, they may use the same signing key for both purposes.
Dummy data¶
By dummy data we mean non-sensitive example data that has an identical structure (column names and types) to the real data. This data needs to be manually created by the analyst. Since data within the VDL cannot be directly inspected, it is useful to have dummy data that behaves somewhat similarly to the real data, in terms of the distribution of values, so that the analyst can see whether their analysis produces the expected result before requesting approval and performing it in the real data.
We typically refer to tables using their handles. However, since the handles are randomly generated upon upload, they differ between the production and design environment. The script submitted for approval needs to refer to the production handles, but while recording, crandas should use the corresponding different handles from the design environment.
To link data in the production environment to the design environment, we use the dummy_for
construct.
Changed in version 1.5: dummy_for
Necessary for dummy data.
Note
Note: this only works if the production data is already available before designing the analysis. If the data is not yet available, the user may use different handles, and swap them out later.
For example, for the following analysis:
import crandas as cd
import crandas.script
# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Calculate the mean stroke age")
# 180A66... and EFC8A9... are the handles of tables that were uploaded to the production environment
survey = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
medical = cd.get_table("EFC8A915781AE712856CA7F88AAED560A92D154BA6F78B90DCE36B287B9BC2AA")
merged = cd.merge(survey, medical, how="inner", left_on="patient_id", right_on="patientnr")
stroke_filter = merged["Stroke"] == 1
mean_stroke_age = merged[stroke_filter]["Age"].mean()
print(total_age)
script.save("mean-stroke-age.json")
When uploading dummy data, we use the following:
# This happens before the call to cd.script.record()
survey_clear = pd.read_csv("gs_data/health_data_survey.csv")
their_survey = pd.read_csv("gs_data/health_their_survey.csv")
survey = pd.concat([our_survey,their_survey])
cd.upload_pandas_dataframe(survey, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
Note that we uploaded the dummy data specifying the table handle used in the production environment.
We perform a similar thing for the medical
table.
Now, if we record the above analysis in the design environment, the script will record correctly!
While in recording mode [3], any calls to crandas.get_table()
will have their handle replaced by the dummy handle for the command result, however the real handle will still be present in the recorded script.
This means that when the script is later saved and approved, the same sequence of commands can be executed in the production environment.
This behavior is the default, and can be turned off by setting map_dummy_handles=False
in the call to crandas.script.record()
.
Uploading data from crandas within authorized mode¶
To upload data to the production environment, we also need approval. A typical user can use the Portal, which is already authorized to perform uploads. To upload data from crandas, the user needs a similar authorization.
Roughly, there are two possibilities:
Option 1: Recording an upload script¶
The user records a script for a single upload. They can use the methods described above to record a script that uploads a table. Of course, they may upload several tables in one script as well. They submit the script for approval, and then load the approved script for use in the production environment, where they upload the real data.
>>> cd.script.record()
>>> table = cd.read_csv("dummy_data.csv")
>>> print(table.handle)
>>> cd.script.save("upload-dummy-data.json")
Note, we do not use dummy_for
here, because there is no production handle available yet.
And then:
>>> cd.script.load("upload-dummy-data.approved")
>>> table = cd.read_csv("real_data.csv")
>>> print(table.handle)
>>> cd.script.close()
We print the handle and should save it somewhere because we probably need it to refer to the table later in the analysis.
The downside of this approach is that the query contains structural information about the uploaded table. This means once approval is obtained, the user may only upload a table of identical structure as the one that was authorized. In practice, this means that once a user wants to upload a different table, they have to obtain approval again.
One advantage of this approach is that named tables may be used, e.g.
>>> cd.read_csv("dummy_data.csv", name="survey")
This is useful when designing scripts for which the production handles are not yet available. Also, the name is easier to memorize than a randomly generated handle.
Warning
Approving a named upload may lead to a potential attack that inadvertently leaks data, where the data provider uploads several different data sets with the same name and compares the result of an approved analysis that uses the table. It is safer to use uploads without a name and reference the resulting handles directly.
Option 2: Authorizing queries outside of scripts¶
Approvers can also authorize queries that are not exact scripts that need to be executed in order. Instead, they may approve certain queries on their own, possibly with certain constraints. An analyst may freely use these authorizations within their analysis.
Note
To create such scripts requires manually editing the JSON scripts and replacing concrete values with ##variable##
tokens, and this is outside of the scope of this guide.
As an example of this, the approvers may authorize arbitrary uploads. The analyst obtains a signed query file with the approval. They can use it as follows.:
cd.base.session.load_authorization("allow-any-upload.approved")
cd.read_csv("real_data.csv")
Several authorizations may be loaded into a single session, and they are all searched to match the query that the user wants to execute.