Approvals¶

Most of the material in this guide is applicable for when executing crandas in design mode. In this mode, any query is allowed and can be immediately executed by the engine, hence this mode is very useful for interactive product exploration and script design. However, queries that reveal all source data can also be executed. Therefore, only dummy data should be uploaded to the engine in design mode, either via crandas or via the platform.

In authorized mode, all queries require prior approval by a fixed set of approvers before they can be executed. When connecting to the engine in design mode, data from the authorized mode is not available, and the other way around. Please check the help center for more information on authorized and design modes.

Working in authorized mode¶

The workflow within authorized mode is roughly as follows:

Data providers upload their data the engine in authorized mode.
An analyst wants to use this data to perform some analysis. They connect to the enigne in design mode to design their queries. For this, they upload dummy data that is in an identical format to the real data that is used in the authorized mode.

In the design mode, the analyst can freely explore and execute all queries.
Once the analyst is satisfied with their analysis, they record their analysis (e.g. a Jupyter notebook) using a crandas.script. The result is a .recording script file.
The analyst submits their script for approval, e.g. using the platform .
The approvers examine the script and determine whether they will allow the analysis.
Once the analysis is approved, the analyst downloads the approved script (the .approved script file that has been digitally signed), e.g. from the platform.
The analyst loads the approved script, and re-runs their analysis while connected to the engine in authorized mode.

Step 1: Uploading production data¶

The easiest way to upload data is to use the platform. Here, data providers can use a web interface to upload their data in CSV format. Each table will get a randomly generated unique handle associated to it. The analyst can see a list of uploaded tables and their associated handles; they will use these handles in their analysis script.

Instead of the platform, crandas can also be used to upload tables directly, see Uploading data from crandas within authorized mode.

Step 2: Exploring with dummy data¶

By dummy data we mean non-sensitive example data that has an identical structure (column names and types) to the real data. This data needs to be manually created by the analyst. Since data within the engine cannot be directly inspected, it is useful to have dummy data that behaves somewhat similarly to the real data, in terms of the distribution of values, so that the analyst can see whether their analysis produces the expected result before requesting approval and performing it in the real data.

We typically refer to tables using their handles. However, since the handles are randomly generated upon upload, they differ between the authorized and design modes. The script submitted for approval needs to refer to the handles from the authorized mode, but while recording, crandas should use the corresponding handles from the design mode.

To link data in the authorized mode to the design mode, we use the dummy_for construct. The dummy_for argument ensures that data is linked to the appropriate production data table:

table = cd.upload_pandas_dataframe(dummy_table, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

Or alternatively, using a csv file:

table = cd.read_csv('../../data/dummy.csv', auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E)

Here, 180A66... is the handle of the corresponding table in the authorized mode. It can be copied from the platform.

Dummy data can be generated by means of any packages that are available, such as Faker. Alternatively, the analyst can create and fill a dummy dataframe within the script itself:

table = cd.DataFrame({"ints_column":[1, 2, 3],
                           "string_column":["a", "b", "c"],
                           "dates_column":["01/02/2000", "02/02/2000", "03/02/2000"]}, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E", auto_bounds=True)

When the dummy data has been initialized and identified as a dummy for the production dataset having the given handle, it can be used to design and record a script.

The analyst can interactively perform operations on dummy tables, to extract the desired information. Once the analyst is satisfied with their analysis, they can clean up their notebook so that the entire notebook is executable in a linear fashion from top to bottom.

Step 3: Recording an analysis¶

To record an analysis, the user calls crandas.script.record() before executing any crandas commands. Any commands executed afterwards are appended to the recorded script. Since recording happens when the user is connected in design mode, the commands will return normal output (on the dummy data).

Important

The structure of the tables used in the design script must be the same as in the authorized mode (except for the content and number of rows). This means that tables must have the same number of columns, with the same types (including nullability), names and order. Tables should also be referenced in the same order.

After the analysis is complete, the user calls crandas.script.save(), which saves the recorded analysis as a .recording script file. This file can then be submitted for approval.

An example:

import crandas as cd
import crandas.script

# connect to design environment
cd.connect('design-environment')

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")

# Note that we uploaded the dummy data specifying the *table handle* that will be used in the authorized mode.
shapes = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

script = script.save("tall-shapes.recording")

Now, if we record the above analysis in design mode, the script will record correctly! While in recording mode [1], any calls to crandas.get_table() will have their handle replaced by the dummy handle for the command result, however the real handle will still be present in the recorded script. This means that when the script is later saved and approved, the same sequence of commands can be executed in authorized mode.

The result is that the recorded script will be placed in a .recording script file named tall-shapes.recording. The user can now upload this file using the platform to obtain approval.

Note

Scripts are always approved to be executed by a designated analyst, that holds a secret analyst key. When using the platform to approve scripts, the platform will use the analyst key that is stored inside the platform itself.

Using Jupyter¶

When an analyst is using a Jupyter notebook, they can use Cells -> Run all (or Kernel -> Restart & Run all). This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename.

Note, after starting to record a script, it is important that all commands are executed in the same order as they will be executed later in the authorized mode. That’s why we recommend using the “Run all” command in a notebook: this ensures that the kernel is restarted and all cells are executed top to bottom.

Step 4: Submitting for approval¶

Recording a script produces a .recording script file. Using the platform, the analyst can upload their recording and request approval.

Step 5: Approve a script¶

The approvers will use to platform to examine the script. They will approve it if they determine all information produced by the analysis meets the parameters of the collaboration agreement. To approve, they digitally sign the script using their private key (approver key), and authorize the script for execution by a designated analyst. Usually, this is also the analyst that created the script.

Step 6: Downloading the approved script¶

The analyst downloads an .approved script file from the platform. This is a digitally signed version of the script file that was submitted in step 4. The script is approved for a designated analyst, that is in possession of the correct analyst key.

Step 7: Executing the analysis¶

Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis.

The script should be modified to connect to the engine in authorized mode by using the appropriate connection file. Instead of having cd.script.record() on top, the analyst inserts cd.script.load(approved_script_filename). Then, at the bottom they replace cd.script.save() by cd.script.reset(). They should also ensure they load their analyst key, by inserting cd.base.session.analyst_key = path_to_analyst_key at the top of the script.

import crandas as cd
import crandas.script

# connect to authorized environment
cd.connect('auth-environment')

# The analyst key must be loaded to be able to execute the query
cd.base.session.analyst_key = 'path_to_analyst_key'

# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True)

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]

print(tall_shapes.open())

# This line was also changed
script.reset()

Since the analysis is identical to the recorded script, except for the two script commands, it will match the authorization and execute in authorized mode.

Using Jupyter¶

After the notebook is set up, the analyst can use Cells -> Run all to execute their notebook from top to bottom. ·As mentioned before, it is essential that the cells are executed in exactly the same order as they were when recording the script.