.. _approvals: Approvals ######### Most of the material in this guide is applicable for when executing crandas in **design mode**. In this mode, any query is allowed and can be immediately executed by the engine, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for dummy data. In **authorized mode**, all queries require prior approval by a fixed set of *approvers* before they can be executed. When connecting to the engine in design mode, data from the authorized mode is not available, and the other way around. Please check the `help center `_ for more information on authorized and design modes. Working in authorized mode ========================== The workflow within authorized mode is roughly as follows: 1. Data providers upload their data the engine in authorized mode. 2. An analyst wants to use this data to perform some analysis. They connect to the enigne in design mode to design their queries. For this, they upload *dummy data* that is in an identical format to the *real data* that is used in the authorized mode. In the design mode, the analyst can freely explore and execute all queries. 3. Once the analyst is satisfied with their analysis, they *record* their analysis (e.g. a Jupyter notebook) using ``crandas.script``. The result is a ``.recording`` *script file*. 4. The analyst submits their script for approval, e.g. using the `platform `_ . 5. The approvers examine the script and determine whether they will allow the analysis. 6. Once the analysis is approved, the analyst downloads the *approved script* (the ``.approved`` script file that has been digitally signed), e.g. from the platform. 7. The analyst loads the *approved script*, and re-runs their analysis notebook while connected to the engine in authorized mode. Step 1: Uploading production data --------------------------------- The easiest way to upload data is to use the platform. Here, data providers can use a web interface to upload their data in CSV format. Each table will get a randomly generated unique handle associated to it. The analyst can see a list of uploaded tables and their associated handles; they will use these handles in their analysis script. Instead of the platform, crandas can also be used directly, see :ref:`authorized-uploading`. Step 2: Exploring with dummy data --------------------------------- Once the structure and handles of all production data are known, the analyst can use the engine in *design mode* to design their analysis. For each of the tables in authorized mode, the analyst uploads a table of dummy data with identical column names and types. To upload dummy data, crandas should be used. The ``dummy_for`` argument ensures that data is linked to the appropriate production data table: :: table = cd.upload_pandas_dataframe(dummy_table, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E") Or alternatively, using a csv file: :: table = cd.read_csv('../../data/dummy.csv', auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E) Here, ``180A66...`` is the handle of the corresponding table in the authorized mode. It can be copied from the platform. Dummy data can be generated by means of any packages that are available, such as `Faker `_. Alternatively, the analyst can create and fill a dummy dataframe within the script itself: :: table = cd.DataFrame({"ints_column":[1, 2, 3], "string_column":["a", "b", "c"], "dates_column":["01/02/2000", "02/02/2000", "03/02/2000"]}, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E", auto_bounds=True) When the dummy data has been initialized and identified as a dummy for the production dataset having the given handle, it can be used to design and record a script. The analyst can interactively perform operations on dummy tables, to extract the desired information. Once the analyst is satisfied with their analysis, they can clean up their notebook so that the entire notebook is executable in a linear fashion from top to bottom. Step 3: Recording an analysis ----------------------------- To record an analysis, the user calls :func:`.crandas.script.record` before executing any crandas commands. Any commands executed afterwards are appended to the recorded script. Since recording happens when the user is **connected in design mode**, the commands will return normal output (on the dummy data). .. important:: The structure of the tables used in the design script must be the same as in the authorized mode (except for the content and number of rows). This means that tables must have the same number of columns, with the same types (including nullability), names and order. Tables should also be referenced in the same order. After the analysis is complete, the user calls :func:`.crandas.script.save`, which saves the recorded analysis as a ``.recording`` script file. This file can then be submitted for approval. An example:: import crandas as cd import crandas.script shapes = cd.DataFrame({ "id": [1, 2, 3], "length": [32, 86, 21], "height": [41, 66, 11], }, auto_bounds=True, dummy_for="180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E") # The name may be omitted, but it helps the approvers script = cd.script.record(name="Filter tall shapes") shapes = cd.get_table("180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E") tall_shape_filter = shapes["height"] > shapes["length"] tall_shapes = shapes[tall_shape_filter] print(tall_shapes.open()) script = script.save("tall-shapes.recording") The result is that the recorded script will be placed in a ``.recording`` script file named ``tall-shapes.recording``. The user can now upload this file using the platform to obtain approval. .. note:: Scripts are always approved to be executed by a *designated analyst*, that holds a secret analyst key. When using the platform to approve scripts, the platform will use the analyst key that is stored inside the platform itself. Using Jupyter ^^^^^^^^^^^^^ If the analyst has interactively designed their analysis using a Jupyter notebook in the design mode, and they are satisfied with it, they can record the script by inserting ``cd.script.record()`` at the top of their analysis. At the bottom, the put the command ``cd.script.save(filename)``. They then run their analysis script. For example, if they are using a Jupyter notebook, they use ``Cells -> Run all`` (or ``Kernel -> Restart & Run all``). This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename. Note, after starting to record a script, it is important that all commands are executed in **the same order** as they will be executed later in the authorized mode. That's why we recommend using the "Run all" command in a notebook: this ensures all cells are executed top to bottom. Step 4: Submitting for approval ------------------------------- Recording a script produces a ``.recording`` script file. Using the platform, the analyst can upload their recording and request approval. Step 5: Approve a script ------------------------ The approvers will use to platform to examine the script. They will approve it if they determine all information produced by the analysis meets the parameters of the collaboration agreement. To approve, they digitally sign the script using their private key (*approver key*), and authorize the script for execution by a designated *analyst*. Usually, this is also the analyst that created the script. Step 6: Downloading the approved script --------------------------------------- The analyst downloads an ``.approved`` script file from the platform. This is a digitally signed version of the script file that was submitted in step 4. The script is approved for a designated analyst, that is in possession of the correct *analyst key*. Step 7: Executing the analysis ------------------------------ Once the script has been approved, the analyst will obtain a digitally signed version of the script file. The analyst can then perform the analysis: :: import crandas as cd import crandas.script # The analyst key must be loaded to be able to execute the query # If `cd.connect()` was called, setting the key must be done this call. cd.base.session.analyst_key = path_to_analyst_key # The following line was changed to load the script script = cd.script.load("tall-shapes.approved") shapes = cd.DataFrame({ "id": [1, 2, 3], "length": [32, 86, 21], "height": [41, 66, 11], }, auto_bounds=True) tall_shape_filter = shapes["height"] > shapes["length"] tall_shapes = shapes[tall_shape_filter] print(tall_shapes.open()) # This line was also changed script.close() Since the analysis is identical to the recorded script, except for the two ``script`` commands, it will match the authorization and execute in the authorized mode. Using Jupyter ^^^^^^^^^^^^^ The notebook should be modified to connect to the engine in authorized mode by using the appropriate connection file. Instead of having ``cd.script.record()`` on top, the analyst inserts ``cd.script.load(approved_script_filename)``. Then, at the bottom they replace ``cd.script.save()`` by ``cd.script.close()``. They also ensure they load their *analyst key*, by inserting ``cd.base.session.analyst_key = path_to_analyst_key`` at the top of the script (if :meth:`crandas.base.connect` was called, this line must be added after it). After the notebook is set up, they use ``Cells -> Run all`` to execute their notebook from top to bottom. As mentioned before, it is essential that the cells are executed in **exactly the same order** as they were when recording the script.