Approvals¶
Most of the material in this guide is applicable for when executing crandas in design mode. In this mode, any query is allowed and can be immediately executed by the engine, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for dummy data.
In authorized mode, all queries require prior approval by a fixed set of approvers before they can be executed. When connecting to the engine in design mode, data from the authorized mode is not available, and the other way around. Please check the help center for more information on authorized and design modes.
Working in authorized mode¶
The workflow within authorized mode is roughly as follows:
Data providers upload their data the engine in authorized mode.
An analyst wants to use this data to perform some analysis. They connect to the enigne in design mode to design their queries. For this, they upload dummy data that is in an identical format to the real data that is used in the authorized mode.
In the design mode, the analyst can freely explore and execute all queries.
Once the analyst is satisfied with their analysis, they record their analysis (e.g. a Jupyter notebook) using
crandas.script
. The result is a.recording
script file.The analyst submits their script for approval, e.g. using the platform .
The approvers examine the script and determine whether they will allow the analysis.
Once the analysis is approved, the analyst downloads the approved script (the
.approved
script file that has been digitally signed), e.g. from the platform.The analyst loads the approved script, and re-runs their analysis notebook while connected to the engine in authorized mode.
Step 1: Uploading production data¶
The easiest way to upload data is to use the platform. Here, data providers can use a web interface to upload their data in CSV format. Each table will get a randomly generated unique handle associated to it. The analyst can see a list of uploaded tables and their associated handles; they will use these handles in their analysis script.
Instead of the platform, crandas can also be used directly, see Uploading data from crandas within authorized mode.
Step 2: Exploring with dummy data¶
Once the structure and handles of all production data are known, the analyst can use the engine in design mode to design their analysis. For each of the tables in authorized mode, the analyst uploads a table of dummy data with identical column names and types.
To upload dummy data, crandas should be used.
The dummy_for
argument ensures that data is linked to the appropriate production data table:
table = cd.upload_pandas_dataframe(dummy_table, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
Or alternatively, using a csv file:
table = cd.read_csv('../../data/dummy.csv', auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E)
Here, 180A66...
is the handle of the corresponding table in the authorized mode. It can be copied from the platform.
Dummy data can be generated by means of any packages that are available, such as Faker. Alternatively, the analyst can create and fill a dummy dataframe within the script itself:
table = cd.DataFrame({"ints_column":[1, 2, 3],
"string_column":["a", "b", "c"],
"dates_column":["01/02/2000", "02/02/2000", "03/02/2000"]}, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E", auto_bounds=True)
When the dummy data has been initialized and identified as a dummy for the production dataset having the given handle, it can be used to design and record a script.
The analyst can interactively perform operations on dummy tables, to extract the desired information. Once the analyst is satisfied with their analysis, they can clean up their notebook so that the entire notebook is executable in a linear fashion from top to bottom.
Step 3: Recording an analysis¶
To record an analysis, the user calls crandas.script.record()
before executing any crandas commands.
Any commands executed afterwards are appended to the recorded script.
Since recording happens when the user is connected in design mode, the commands will return normal output (on the dummy data).
Important
The structure of the tables used in the design script must be the same as in the authorized mode (except for the content and number of rows). This means that tables must have the same number of columns, with the same types (including nullability), names and order. Tables should also be referenced in the same order.
After the analysis is complete, the user calls crandas.script.save()
, which saves the recorded analysis as a .recording
script file.
This file can then be submitted for approval.
An example:
import crandas as cd
import crandas.script
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
# The name may be omitted, but it helps the approvers
script = cd.script.record(name="Filter tall shapes")
shapes = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E")
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
script = script.save("tall-shapes.recording")
The result is that the recorded script will be placed in a .recording
script file named tall-shapes.recording
.
The user can now upload this file using the platform to obtain approval.
Note
Scripts are always approved to be executed by a designated analyst, that holds a secret analyst key. When using the platform to approve scripts, the platform will use the analyst key that is stored inside the platform itself.
Using Jupyter¶
If the analyst has interactively designed their analysis using a Jupyter notebook in the design mode, and they are satisfied with it, they can record the script by inserting cd.script.record()
at the top of their analysis.
At the bottom, the put the command cd.script.save(filename)
.
They then run their analysis script.
For example, if they are using a Jupyter notebook, they use Cells -> Run all
(or Kernel -> Restart & Run all
).
This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename.
Note, after starting to record a script, it is important that all commands are executed in the same order as they will be executed later in the authorized mode. That’s why we recommend using the “Run all” command in a notebook: this ensures all cells are executed top to bottom.
Step 4: Submitting for approval¶
Recording a script produces a .recording
script file.
Using the platform, the analyst can upload their recording and request approval.
Step 5: Approve a script¶
The approvers will use to platform to examine the script. They will approve it if they determine all information produced by the analysis meets the parameters of the collaboration agreement. To approve, they digitally sign the script using their private key (approver key), and authorize the script for execution by a designated analyst. Usually, this is also the analyst that created the script.
Step 6: Downloading the approved script¶
The analyst downloads an .approved
script file from the platform.
This is a digitally signed version of the script file that was submitted in step 4.
The script is approved for a designated analyst, that is in possession of the correct analyst key.
Step 7: Executing the analysis¶
Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis:
import crandas as cd
import crandas.script
# The analyst key must be loaded to be able to execute the query
# If `cd.connect()` was called, setting the key must be done this call.
cd.base.session.analyst_key = path_to_analyst_key
# The following line was changed to load the script
script = cd.script.load("tall-shapes.approved")
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True)
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
print(tall_shapes.open())
# This line was also changed
script.reset()
Since the analysis is identical to the recorded script, except for the two script
commands, it will match the authorization and execute in the authorized mode.
Using Jupyter¶
The notebook should be modified to connect to the engine in authorized mode by using the appropriate connection file.
Instead of having cd.script.record()
on top, the analyst inserts cd.script.load(approved_script_filename)
.
Then, at the bottom they replace cd.script.save()
by cd.script.reset()
.
They also ensure they load their analyst key, by inserting cd.base.session.analyst_key = path_to_analyst_key
at the top of the script (if crandas.base.connect()
was called, this line must be added after it).
After the notebook is set up, they use Cells -> Run all
to execute their notebook from top to bottom.
As mentioned before, it is essential that the cells are executed in exactly the same order as they were when recording the script.