Authorization: Design¶
In this script, we merge and analyze the data from 2 pharmaceutical companies. This tutorial shows the process of creating a script that has to be approved by the approvers. At the end of this script, we will have a file that must be signed by the participating parties before the same analysis can be executed over “real” data.
Important
This script must be run in the design environment
While we perform data merging, compute conditional values and visualize
the results using bar plots, the main goal of this tutorial is script
recording. We will learn to use the record
function to record the
script for approval. This function creates a JSON that can be uploaded
to the portal.
#load libraries
import crandas as cd
from pathlib import Path
import plotly.express as px
In order to record a script, we need to create a table with the same structure as the real data (i.e. the data that will be used in production). This dummy data will be used in the script recording as a ‘stand-in’ for the actual data. In order to do this for another party’s data, the data owner must have communicated the data structure. Alternatively, the analyst could obtain this information from the web portal data source page.
The variable dummy_for
is used to link the dummy data in the design
environment with the real data in the production environment.
Note: Both dummy datasets are uploaded to the web portal prior to the recording
# Insert the handles of each table from the web portal
prod_table1_handle = 'INSERT_TABLE_1_HANDLE_HERE'
prod_table2_handle = 'INSERT_TABLE_2_HANDLE_HERE'
# Upload dummy data for party 1
party1_data = cd.read_csv('../../data/auth_data/party1_dummy.csv', auto_bounds=True, dummy_for=prod_table1_handle)
# Don't forget to change the handle above
We also need to create dummy data for the table belonging to the other
party. In this case, we assume that the other party provided a dummy
data file party2_dummy.csv
. When uploading this data, we also have
to specify the handle of the real table that this table represents
using dummy_for
.
# Upload dummy data for party 2
party2_data = cd.read_csv('../../data/auth_data/party2_dummy.csv', auto_bounds=True, dummy_for=prod_table2_handle)
# Don't forget to change the handle above
Start the script recording¶
script = cd.script.record()
Now that the recording has started we need to access the production tables using their handles (for the recording it will use the dummy dataframe/csv we uploaded above).
# Add the handle of Party 1's data
party1_table = cd.get_table(prod_table1_handle)
Here we use the handle from the data source that Party 2 uploaded to the web portal.
# Add the handle of Party 2's data
party2_table = cd.get_table(prod_table2_handle)
Join tables¶
Here we do a simple join that we will query afterwards. Note that a join requires the keys they are being joined on to be identical.
merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
repr(merged)
Retrieve statistics¶
merged['condition_y'].mean()
merged[(merged["smokes"]==1)]["condition_y"].mean()
def compute_sum(month):
try:
result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
return result # computes means only if there are 3 cases for the month
except:
return None # if not, no output is given
dic={"Month": ['Jan', 'May', 'September'], "Sum of Asthma": [
compute_sum(1),compute_sum(5),compute_sum(9)]}
# Data visualization is done with data that is already in the clear, so it does not need to be in the script.
# But it can
plot=px.bar(dic, x="Month", y="Sum of Asthma")
plot.show()
Finally we save the script (everything from script.record()
up to
here) as a json file for you to download and upload to the web portal
for approval.
script = cd.script.save('analysis.json')
After getting approval for the script, it can be executed with production data. In order to do this, use the Production script in the production environment.