Approvals¶
Most of the material in this guide is applicable for when executing crandas in a design environment, which generally run in unauthorized mode. In this mode, any query is allowed and can be immediately executed by the engine, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for environments that only contain dummy data.
In authorized environments, all queries require prior approval by a fixed set of approvers before they can be executed. Please check the help center for more information on design and authorized environments.
Working in authorized environments¶
The workflow within authorized mode is roughly as follows:
Data providers upload their data to the authorized environment.
An analyst wants to use this data to perform some analysis. They use a design environment of the engine to design their queries. For this, they upload dummy data that is in an identical format to the real data that is used in the authorized environment.
In the design environment, the analyst can freely explore and execute all queries.
Once the analyst is satisfied with their analysis, they record their analysis (e.g. a Jupyter notebook) using
crandas.script
. The result is a JSON-formatted script file.The analyst submits their script for approval, e.g. using the platform .
The approvers examine the script and determine whether they will allow the analysis.
Once the analysis is approved, the analyst downloads the approved script (the JSON-formatted script file that has been digitally signed), e.g. from the platform.
The analyst loads the approved script, and re-runs their analysis notebook in the authorized environment.
Step 1: Uploading production data¶
The easiest way to upload data is to use the platform. Here, data providers can use a web interface to upload their data in CSV format. Each table will get a randomly generated unique handle associated to it. The analyst can see a list of uploaded tables and their associated handles; they will use these handles in their analysis script.
Instead of the platform, crandas can also be used directly, see Uploading data from crandas within authorized mode.
Step 2: Exploring with dummy data¶
Once the structure and handles of all production data are known, the analyst can use the design environment to design their analysis. For each of the tables in an authorized environment, the analyst uploads a table of dummy data with identical column names and types.
To upload dummy data, crandas should be used.
The dummy_for
argument ensures that data is linked to the appropriate production data table:
table = cd.upload_pandas_dataframe(dummy_table, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
Or alternatively, using a csv file:
table = cd.read_csv('../../data/dummy.csv', auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E)
Here, 180A66...
is the handle of the corresponding table in the authorized environment. It can be copied from the platform.
Dummy data can be generated by means of any packages that are available, such as Faker. Alternatively, the analyst can create and fill a dummy dataframe within the script itself:
table = cd.DataFrame({"ints_column":[1, 2, 3],
"string_column":["a", "b", "c"],
"dates_column":["01/02/2000", "02/02/2000", "03/02/2000"]}, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E", auto_bounds=True)
When the dummy data has been initialized and identified as a dummy for the production dataset having the given handle, it can be used to design and record a script.
The analyst can interactively perform operations on dummy tables, to extract the desired information. Once the analyst is satisfied with their analysis, they can clean up their notebook so that the entire notebook is executable in a linear fashion from top to bottom.
Step 3: Recording an analysis¶
To record an analysis, the user calls crandas.script.record()
before executing any crandas commands.
Any commands executed afterwards are appended to the recorded script.
Since recording happens when the user is connected to the design environment, the commands will return normal output (on the dummy data).
Important
The structure of the tables used in the design script must be the same as in the authorized environment (except for the content and number of rows). This means that tables must have the same number of columns, with the same types (including nullability), names and order. Tables should also be referenced in the same order.
After the analysis is complete, the user calls crandas.script.save()
, which saves the recorded analysis as a JSON-formatted script file.
This file can then be submitted for approval.
An example:
import crandas as cd
import crandas.script
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")
shapes = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
script = script.save("tall-shapes.json")
The result is that the recorded script will be placed in a JSON-formatted script file named tall-shapes.json
.
The user can now upload this file using the platform to obtain approval.
Note
Scripts are always approved to be executed by a designated analyst, that holds a secret analyst key. When using the platform to approve scripts, the platform will use the analyst key that is stored inside the platform itself.
Using Jupyter¶
If the analyst has interactively designed their analysis using a Jupyter notebook using the design environment, and they are satisfied with it, they can record the script by inserting cd.script.record()
at the top of their analysis.
At the bottom, the put the command cd.script.save(filename)
.
They then run their analysis script.
For example, if they are using a Jupyter notebook, they use Cells -> Run all
(or Kernel -> Restart & Run all
).
This will start recording a script, run the commands in order against the design environment, and save the script to the specified filename.
Note, after starting to record a script, it is important that all commands are executed in the same order as they will be executed later in the authorized environment. That’s why we recommend using the “Run all” command in a notebook: this ensures all cells are executed top to bottom.
Step 4: Submitting for approval¶
Recording a script produces a .json
formatted script file.
Using the platform, the analyst can upload their JSON and request approval.
Step 5: Approve a script¶
The approvers will use to platform to examine the script. They will approve it if they determine all information produced by the analysis meets the parameters of the collaboration agreement. To approve, they digitally sign the script using their private key (approver key), and authorize the script for execution by a designated analyst. Usually, this is also the analyst that created the script.
Step 6: Downloading the approved script¶
The analyst downloads an .approved
script file from the platform.
This is a digitally signed version of the script file that was submitted in step 4.
The script is approved for a designated analyst, that is in possession of the correct analyst key.
Step 7: Executing the analysis¶
Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis:
import crandas as cd
import crandas.script
# The analyst key must be loaded to be able to execute the query
cd.base.session.analyst_key = path_to_analyst_key
# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True)
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
# This line was also changed
script.close()
Since the analysis is identical to the recorded script, except for the two script
commands, it will match the authorization and execute on the authorized environment.
Using Jupyter¶
The notebook should be modified to connect to the correct authorized instance.
Instead of having cd.script.record()
on top, the analyst inserts cd.script.load(approved_script_filename)
.
Then, at the bottom they replace cd.script.save()
by cd.script.close()
.
They also ensure they load their analyst key, by inserting cd.base.session.analyst_key = path_to_analyst_key
at the top of the script.
After the notebook is set up, they use Cells -> Run all
to execute their notebook from top to bottom.
As mentioned before, it is essential that the cells are executed in exactly the same order as they were when recording the script.