Authorization: Design ===================== In this script, we merge and analyze the data from 2 pharmaceutical companies. This tutorial shows the process of creating a script that has to be approved by the approvers. At the end of this script, we will have a file that must be signed by the participating parties before the *same* analysis can be executed over “real” data. .. important:: This script must be run in the design environment While we perform data merging, compute conditional values and visualize the results using bar plots, the main goal of this tutorial is script recording. We will learn to use the ``record`` function to record the script for approval. This function creates a JSON that can be uploaded to the portal. .. code:: python #load libraries import crandas as cd from pathlib import Path import plotly.express as px # On a jupyter environment provided by Roseman Labs, session variables are set automatically in the background - unless you wish to test the approval flow within jupyter (shown below) # Set the session base_path and session.endpoint manually when executing this notebook in any other environment # session.endpoint = 'https://localhost:9820/api/v1' In order to record a script, we need to create a table with the same structure as the real data (i.e. the data that will be used in production). This *dummy* data will be used in the script recording as a ‘stand-in’ for the actual data. In order to do this for another party’s data, the data owner must have communicated the data structure. Alternatively, the analyst could obtain this information from the web portal data source page. The variable ``dummy_for`` is used to link the dummy data in the design environment with the real data in the production environment. *Note: Both dummy datasets are uploaded to the web portal* **prior** *to the recording* .. code:: python # Insert the handles of each table from the web portal prod_table1_handle = 'INSERT_TABLE_1_HANDLE_HERE' prod_table2_handle = 'INSERT_TABLE_2_HANDLE_HERE' .. code:: python # Upload dummy data for party 1 party1_data = cd.read_csv('../../data/auth_data/party1_dummy.csv', auto_bounds=True, dummy_for=prod_table1_handle) # Don't forget to change the handle above We also need to create dummy data for the table belonging to the other party. In this case, we assume that the other party provided a dummy data file ``party2_dummy.csv``. When uploading this data, we also have to specify the handle of the *real* table that this table represents using ``dummy_for``. .. code:: python # Upload dummy data for party 2 party2_data = cd.read_csv('../../data/auth_data/party2_dummy.csv', auto_bounds=True, dummy_for=prod_table2_handle) # Don't forget to change the handle above Start the script recording -------------------------- .. code:: python script = cd.script.record() Now that the recording has started we need to access the production tables using their handles (for the recording it will use the dummy dataframe/csv we uploaded above). .. code:: python # Add the handle of Party 1's data party1_table = cd.get_table(prod_table1_handle) Here we use the handle from the data source that Party 2 uploaded to the web portal. .. code:: python # Add the handle of Party 2's data party2_table = cd.get_table(prod_table2_handle) Join tables ----------------------------------- Here we do a simple join that we will query afterwards. Note that a join requires the keys they are being joined on to be identical. .. code:: python merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes']) repr(merged) Retrieve statistics ------------------------ .. code:: python merged['condition_y'].mean() .. code:: python merged[(merged["smokes"]==1)]["condition_y"].mean() .. code:: python def compute_sum(month): try: result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum() return result # computes means only if there are 3 cases for the month except: return None # if not, no output is given dic={"Month": ['Jan', 'May', 'September'], "Sum of Asthma": [ compute_sum(1),compute_sum(5),compute_sum(9)]} # Data visualization is done with data that is already in the clear, so it does not need to be in the script. # But it can plot=px.bar(dic, x="Month", y="Sum of Asthma") plot.show() Finally we save the script (everything from ``script.record()`` up to here) as a json file for you to download and upload to the web portal for approval. .. code:: python script = cd.script.save('analysis.json') After getting approval for the script, it can be executed with production data. In order to do this, use the **Production** script in the production environment.