Authorization: Design¶

In this script, we merge and analyze the data from 2 pharmaceutical companies. This tutorial shows the process of creating a script that has to be approved by the approvers. At the end of this script, we will have a file that must be signed by the participating parties before the same analysis can be executed over “real” data.

Important

This script must be run in the design environment

While we perform data merging, compute conditional values and visualize the results using bar plots, the main goal of this tutorial is script recording. We will learn to use the record function to record the script for approval. This function creates a JSON that can be uploaded to the portal.

#load libraries
import crandas as cd
from pathlib import Path
import plotly.express as px

In order to record a script, we need to create a table with the same structure as the real data (i.e. the data that will be used in production). This dummy data will be used in the script recording as a ‘stand-in’ for the actual data. In order to do this for another party’s data, the data owner must have communicated the data structure. Alternatively, the analyst could obtain this information from the web portal data source page.

The variable dummy_for is used to link the dummy data in the design environment with the real data in the production environment.

Note: Both dummy datasets are uploaded to the web portal prior to the recording

# Insert the handles of each table from the web portal
prod_table1_handle = 'INSERT_TABLE_1_HANDLE_HERE'
prod_table2_handle = 'INSERT_TABLE_2_HANDLE_HERE'

# Upload dummy data for party 1
party1_data = cd.read_csv('../../data/auth_data/party1_dummy.csv', auto_bounds=True, dummy_for=prod_table1_handle)
# Don't forget to change the handle above

We also need to create dummy data for the table belonging to the other party. In this case, we assume that the other party provided a dummy data file party2_dummy.csv. When uploading this data, we also have to specify the handle of the real table that this table represents using dummy_for.

# Upload dummy data for party 2
party2_data = cd.read_csv('../../data/auth_data/party2_dummy.csv', auto_bounds=True, dummy_for=prod_table2_handle)
# Don't forget to change the handle above

Start the script recording¶

script = cd.script.record()

Now that the recording has started we need to access the production tables using their handles (for the recording it will use the dummy dataframe/csv we uploaded above).

# Add the handle of Party 1's data
party1_table = cd.get_table(prod_table1_handle)

Here we use the handle from the data source that Party 2 uploaded to the web portal.

# Add the handle of Party 2's data
party2_table = cd.get_table(prod_table2_handle)

Join tables¶

Here we do a simple join that we will query afterwards. Note that a join requires the keys they are being joined on to be identical.

merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
repr(merged)

Retrieve statistics¶

merged['condition_y'].mean()

merged[(merged["smokes"]==1)]["condition_y"].mean()

def compute_sum(month):
    try:
        result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
        return result             # computes means only if there are 3 cases for the month
    except:
        return None                # if not, no output is given

dic={"Month": ['Jan', 'May', 'September'], "Sum of Asthma": [
     compute_sum(1),compute_sum(5),compute_sum(9)]}

# Data visualization is done with data that is already in the clear, so it does not need to be in the script.
# But it can
plot=px.bar(dic, x="Month", y="Sum of Asthma")
plot.show()

Finally we save the script (everything from script.record() up to here) as a json file for you to download and upload to the web portal for approval.

script = cd.script.save('analysis.json')

After getting approval for the script, it can be executed with production data. In order to do this, use the Production script in the production environment.