Approvals#

Most of the material in this guide is applicable for when executing crandas in a design environment, which generally run in unauthorized mode. In this mode, any query is allowed and can be immediately executed by the VDL, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for environments that only contain dummy data.

In production environments, we recommend running the VDL in authorized mode. In this mode, all queries require prior approval by a fixed set of approvers before it can be executed. Please check the help center for more information on design and production environments.

Working within authorized mode#

The workflow within authorized mode is roughly as follows:

  1. Data providers upload their data to the production environment.

  2. An analyst wants to use this data to perform some analysis. They use a design environment of the VDL to design their queries. For this, they upload dummy data that is in an identical format to the real data that is used in the production environment.

    In the design environment, the analyst can freely explore and execute all queries.

  3. Once the analyst is satisfied with their analysis, they record their analysis (e.g. a Jupyter notebook) using crandas.script. The result is a JSON-formatted script file.

  4. The analyst submits their script for approval, e.g. using the web portal .

  5. The approvers examine the script and determine whether they will allow the analysis.

  6. Once the analysis is approved, the analyst downloads the approved script (the JSON-formatted script file that has been digitally signed), e.g. from the portal.

  7. The analyst loads the approved script, and re-runs their analysis notebook in the production environment.

Step 1: Uploading production data#

The easiest way to upload data is to use the Portal. Here, data providers can use a web interface to upload their data in CSV format. Each table will get a randomly generated unique handle associated to it. The analyst can see a list of uploaded tables and their associated handles; they will use these handles in their analysis script.

Instead of the portal, crandas can also be used directly, see Uploading data from crandas within authorized mode.

Step 2: Exploring with dummy data#

Once the structure and handles of all production data are known, the analyst can use the design environment to design their analysis. For each of the tables in production, the analyst uploads a table of dummy data with identical column names and types.

To upload dummy data, crandas should be used. The dummy_for argument ensures that data is linked to the appropriate production table:

table = cd.upload_pandas_dataframe(dummy_table, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

Or alternatively, using a .csv:

table = cd.read_csv('../../data/dummy.csv', auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E)

Here, 180A66... is the handle of the corresponding table in the production environment. It can be copied from the web portal.

Dummy data can be generated by means of any packages that are available, such as Faker. Alternatively, the analyst can create and fill a dummy dataframe within the script itself:

table = cd.DataFrame({"ints_column":[1, 2, 3],
                           "string_column":["a", "b", "c"],
                           "dates_column":["01/02/2000", "02/02/2000", "03/02/2000"]}, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E", auto_bounds=True)

When the dummy data has been initialized and identified as a dummy for the production dataset having the given handle, it can be used to design and record a script.

The analyst can interactively perform operations on dummy tables, to extract the desired information. Once the analyst is satisfied with their analysis, they can clean up their notebook so that the entire notebook is executable in a linear fashion from top to bottom.

Step 3: Recording an analysis#

To record an analysis, the user calls crandas.script.record() before executing any crandas commands. Any commands executed afterwards are appended to the recorded script. Since recording happens when the user is connected to the design environment, the commands will return normal output (on the dummy data).

After the analysis is complete, the user calls crandas.script.save(), which saves the recorded analysis as a JSON-formatted script file. This file can then be submitted for approval.

An example:

import crandas as cd
import crandas.script

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")

shapes = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

script = script.save("tall-shapes.json")

The result is that the recorded script will be placed in a JSON-formatted script file named tall-shapes.json. The user can now upload this file using the Portal to obtain approval.

Note

Scripts are always approved to be executed by a designated analyst, that holds a secret analyst key. When using the Portal to approve scripts, the Portal will use the analyst key that is stored inside the Portal itself.

Using Jupyter#

If the analyst has interactively designed their analysis using a Jupyter notebook using the design environment, and they are satisfied with it, they can record the script by inserting cd.script.record() at the top of their analysis. At the bottom, the put the command cd.script.save(filename).

They then run their analysis script. For example, if they are using a Jupyter notebook, they use Cells -> Run all (or Kernel -> Restart & Run all). This will start recording a script, run the commands in order against the design environment, and save the script to the specified filename.

Note, after starting to record a script, it is important that all commands are executed in the same order as they will be executed later in the production environment. That’s why we recommend using the “Run all” command in a notebook: this ensures all cells are executed top to bottom.

Step 4: Submitting for approval#

Recording a script produces a .json formatted script file. Using the Portal, the analyst can upload their JSON and request approval.

Step 5: Approve a script#

The approvers will use to Portal to examine the script. They will approve it if they determine all information produced by the analysis meets the parameters of the collaboration agreement. To approve, they digitally sign the script using their private key (approver key), and authorize the script for execution by a designated analyst. Usually, this is also the analyst that created the script.

Step 6: Downloading the approved script#

The analyst downloads an .approved script file from the Portal. This is a digitally signed version of the script file that was submitted in step 4. The script is approved for a designated analyst, that is in possession of the correct analyst key.

Step 7: Executing the analysis#

Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis:

import crandas as cd
import crandas.script

# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True)

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

# This line was also changed
script.close()

Since the analysis is identical to the recorded script, except for the two script commands, it will match the authorization and execute on the production environment.

Using Jupyter#

The notebook should be modified to connect to the correct production instance. Instead of having cd.script.record() on top, the analyst inserts cd.script.load(approved_script_filename). Then, at the bottom they replace cd.script.save() by cd.script.close(). They also ensure they load their analyst key, by inserting cd.base.session.analyst_key = path_to_analyst_key at the top of the script.

After the notebook is set up, they use Cells -> Run all to execute their notebook from top to bottom. As mentioned before, it is essential that the cells are executed in exactly the same order as they were when recording the script.