Approvals

Most of the material in this guide is applicable for when the VDL servers run in unauthorized mode. In this mode, any query is allowed and can be immediately executed by the VDL. This includes revealing all source data, with the table.open() command, so this mode is mostly intended for environments that only contain dummy data.

In production environments, we recommend running the VDL in authorized mode. Here, all queries have to have prior approval by a fixed set of approvers before it can be executed.

Here, a query is a request for a single computation that crandas sends to the VDL servers. Every command in crandas, like a join or a concatenation for instance, works by sending one or more queries to the VDL. The VDL then distributes the query among all three servers, and if all servers agree that the query should be executed, they perform the query and send back the result of the query (usually a table) to the client.

In authorized mode, the VDL servers only accept the query if it is accompanied by list of digital signatures; one signature for each pre-configured approver. Approvers hold a secret signing key, that they use to generate the digital signatures. For each signing key there is a unique associated verification key, that may be distributed publicly. The VDL servers are configured with the verification keys of each approver, and they use the verification keys to check whether the signatures on the queries that they receive are valid.

The actual situation is more complicated, because besides the signatures of the approvers, the analyst that uses crandas also signs the query. This is because the approvers will only approve a query to be executed by a single designated analyst. To accomplish this, the user of crandas will generate an analyst key, which like the above also consists of two parts: a secret signing key, and a public verification key. The approvers include the verification key of the analyst in their approval. When crandas wants to execute a query, it first signs the query itself using the secret analyst key. It then sends the query, together with the signatures from the approvers, to the VDL server.

Working within authorized mode

The workflow within authorized mode is roughly as follows:

  1. Data providers upload their data to the production environment.

  2. An analyst wants to use this data to perform some analysis. They use a design environment of the VDL to design their queries. For this, they upload dummy data that is in an identical format to the real data that is used in the production environment.

    In the design environment, the analyst can freely explore and execute all queries.

  3. Once the analyst is satisfied with their analysis, they record their analysis (e.g. a Jupyter notebook) using crandas.script. The result is a JSON-formatted script file.

  4. The analyst submits their script for approval, e.g. using the Portal.

  5. The approvers examine the script and determine whether they will allow the analysis.

  6. Once the analysis is approved, the analyst downloads the approved script (the JSON-formatted script file that has been digitally signed), e.g. from the Portal.

  7. The analyst loads the approved script, and re-runs their analysis notebook in the production environment.

Recording an analysis

To record an analysis, the user calls crandas.script.record() before executing any crandas commands. Then, they may perform any number of crandas commands. Since recording happens when the user is connected to the design environment, the commands will return normal output (on the dummy data).

After the analysis is complete, the user calls crandas.script.save() or Script.save(), which saves the recorded analysis as a JSON-formatted script file. This file can then be submitted for approval.

An example:

import crandas as cd
import crandas.script

# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True)

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

script = script.save("tall-shapes.json")

The result is that the recorded script will be placed in a JSON-formatted script file named tall-shapes.json. The user can now upload this file using the Portal to obtain approval.

Scripts are always approved to be executed by a designated analyst, that holds a secret analyst key. By default, the current user’s Session.analyst_key is included in the recorded script.

Executing an approved analysis

Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis:

import crandas as cd
import crandas.script

# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True)

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

script.close()

Note that the commands of the analysis are exactly the same as when the script was recording, and they occur in exactly the same order. This is necessary for the script to work correctly. It is therefore recommended when recording scripts from a Jupyter notebook to save the notebook and then use the “Run all” command to execute all cells in order. Once the approved script has been obtained, the only commands that need to be modified are the cd.script.load and script.save commands, alongside with the configuration to point crandas to the correct server.

Dummy data

In the above example analysis, the computation consisted of an upload together with a filter command. Typically, uploads will not happen inside of the analysis, but rather an analysis will refer to data that was uploaded earlier. In the production environment, the analysis will refer to the real data, whereas in the design environment we use dummy data to explore.

As we have seen, we may refer to tables using their handle, which gets randomly generated on upload. An analysis may look like this:

import crandas as cd
import crandas.script

# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Calculate the mean stroke age")

# 180A66... and EFC8A9... are the handles of tables that were uploaded to the production environment
survey = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
medical = cd.get_table("EFC8A915781AE712856CA7F88AAED560A92D154BA6F78B90DCE36B287B9BC2AA")
merged = cd.merge(survey, medical, how="inner", left_on="patient_id", right_on="patientnr")

stroke_filter = merged["Stroke"] == 1
mean_stroke_age = merged[stroke_filter]["Age"].mean()
print(total_age)

script.save("mean-stroke-age.json")

The problem is that if the handles 180A66... and EFC8A9... refer to tables in the production environment, then when recording the script in the design environment, the get_table functions will return an error, because the tables do not exist: if dummy data was uploaded for the survey and medical tables, they will have different random handles. We could change the handles in the analysis to use the handles for the dummy data, but then the recorded script would have the wrong handles.

One solution is to use named tables, as we have seen before, where the table name can be the same in both the design and production environments.

A different approach using handles is to use the dummy_for construct. When uploading dummy data, for the analysis above, we use the following:

# This happens before the call to cd.script.record()
survey_clear = pd.read_csv("gs_data/health_data_survey.csv")
their_survey = pd.read_csv("gs_data/health_their_survey.csv")
survey = pd.concat([our_survey,their_survey])
cd.upload_pandas_dataframe(survey, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

Note that we uploaded the dummy data specifying the table handle used in the production environment. We perform a similar thing for the medical table.

Now, if we record the above analysis in the design environment, the script will record correctly! While in recording mode [1], any calls to crandas.get_table() will have their handle replaced by the dummy handle for the command result, however the real handle will still be present in the recorded script. This means that when the script is later saved and approved, the same sequence of commands can be executed in the production environment.

Uploading data from crandas within authorized mode

To upload data to the production environment, we also need approval. A typical user can use the Portal, which is already authorized to perform uploads. To upload data from crandas, the user needs a similar authorization.

Roughly, there are two possibilities:

Option 1: Recording an upload script

The user records a script for a single upload. They can use the methods described above to record a script that uploads a table. Of course, they may upload several tables in one script as well. They submit the script for approval, and then load the approved script for use in the production environment, where they upload the real data.

>>> cd.script.record()
>>> table = cd.read_csv("dummy_data.csv")
>>> print(table.handle)
>>> cd.script.save("upload-dummy-data.json")

Note, we do not use dummy_for here, because there is no production handle available yet.

And then:

>>> cd.script.load("upload-dummy-data.approved")
>>> table = cd.read_csv("real_data.csv")
>>> print(table.handle)
>>> cd.script.close()

We print the handle and should save it somewhere because we probably need it to refer to the table later in the analysis.

The downside of this approach is that the query contains structural information about the uploaded table. This means once approval is obtained, the user may only upload a table of identical structure as the one that was authorized. In practice, this means that once a user wants to upload a different table, they have to obtain approval again.

One advantage of this approach is that named tables may be used, e.g.

>>> cd.read_csv("dummy_data.csv", name="survey")

This is useful when designing scripts for which the production handles are not yet available. Also, the name is easier to memorize than a randomly generated handle.

Option 2: Authorizing queries outside of scripts

Approvers can also authorize queries that are not exact scripts that need to be executed in order. Instead, they may approve certain queries on their own, possibly with certain constraints. An analyst may freely use these authorizations within their analysis.

As an example of this, the approvers may authorize arbitrary uploads. The analyst obtains a signed query file with the approval. They can use it as follows.:

cd.base.session.load_authorization("allow-any-upload.approved")
cd.read_csv("real_data.csv")

Several authorizations may be loaded into a single session, and they are all searched to match the query that the user wants to execute.