.. _platform_data_uploads:

Platform data upload
===================================

Recording
---------

Here we show a simple example where we merge and analyze data from 2 pharmaceutical companies, uploaded via a `data upload <https://support.rosemanlabs.com/how-to-upload-a-data-source>`_.

We start by uploading the production datasets to the authorized platform (in practice, this would be done by the actual data providers). 

**Dataset 1**

.. image:: images/authorization_recording_0.png

**Dataset 2**

.. image:: images/authorization_recording_1.png

In order to record a script, we need to create a table with the same structure as the real data (i.e. the data that will be used in production). This dummy data will be used in the script recording as a `stand-in` for the actual data that is uploaded to the platform. 
The structure of both production datasets can be accessed there. We now log-in to the design platform and upload our dummy data there (verify that the data layout is identical to the layout of the layout of the production data). 

Now that we have uploaded the datasets, we can navigate to the data source view for one of our dummy datasets in the design platform and copy the following code snippet (the values for ``DESIGN_ENV`` and ``DUMMY_HANDLE`` should have been filled in):

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"

    # copy from authorized platform:
    AUTH_ENV = ""
    ANALYST_KEY = ""  
    PROD_HANDLE = ""
    SCHEMA = {''} 

    # fill this in yourself (no spaces):
    ANALYSIS_NAME = ""


    import crandas as cd

    cd.connect(DESIGN_ENV)

    cd.get_table(DUMMY_HANDLE,dummy_for=PROD_HANDLE)

    script = cd.script.record(name=ANALYSIS_NAME)

    table = cd.get_table(PROD_HANDLE, schema=SCHEMA)

    ## add any analysis on `table`

    script.save(ANALYSIS_NAME + '.recording')

Since we want to perform an analysis on two datasets, we also add the handle of the other dummy dataset here (which we can copy from the platform):

.. image:: images/authorization_recording_2.png

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"


Now we navigate to the authorized platform, and select the data source view of the dummy dataset corresponding to dummy table 1. 

.. image:: images/authorization_recording_3.png

Here we can find the values for the capitalized environment variables of the authorized platform, which we can copy into the script:

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"

    # copy from authorized platform:
    AUTH_ENV = "vdl-demo"
    ANALYST_KEY = "vdl-demo-private-key.sk"  
    PROD_HANDLE = "C68A6E217650F072E8BA585C6E3446CFFE38CE9E44E6C9AEDBADBA47CAE1E218"
    SCHEMA = {''} 

Now fill in the schema of the dataset, based on the data layout as displayed on the data source view:

.. note::

    Don't forget to add a '?' to the data types of nullable columns (see also :ref:`nullable`)


.. image:: images/authorization_recording_4.png


.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"

    # copy from authorized platform:
    AUTH_ENV = "vdl-demo"
    ANALYST_KEY = "vdl-demo-private-key.sk"  
    PROD_HANDLE = "C68A6E217650F072E8BA585C6E3446CFFE38CE9E44E6C9AEDBADBA47CAE1E218"
    SCHEMA = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'condition_x':'int',
        'category':'varchar',
        'batch_nr':'int',
        'smokes':'bool',
        'prescription':'int',
        'quantity':'int',
        'packaging':'int'
    } 


Afterwards, add the handle and schema of the second production dataset. Also, fill in the analysis name. 

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"

    # copy from authorized platform:
    AUTH_ENV = "vdl-demo"
    ANALYST_KEY = "vdl-demo-private-key.sk"  
    PROD_HANDLE = "C68A6E217650F072E8BA585C6E3446CFFE38CE9E44E6C9AEDBADBA47CAE1E218"
    SCHEMA = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'condition_x':'int',
        'category':'varchar',
        'batch_nr':'int',
        'smokes':'bool',
        'prescription':'int',
        'quantity':'int',
        'packaging':'int'
    } 
    PROD_HANDLE_2 = "74FE5E767FB36E084BE5B7EFD692C880DF99953CD4BE834C4193713A48A8CD17"
    SCHEMA_2 = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'batch_nr':'int',
        'smokes':'bool',
        'condition_y':'bool'
    } 

    # fill this in yourself (no spaces):
    ANALYSIS_NAME = "analysis"

Now that we have defined all environment variables and referenced the dummy and production datasets, we can add our actual analysis code to the script. Here, we join the 
two tables and retrieve some simple statistics:

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"

    # copy from authorized platform:
    AUTH_ENV = "vdl-demo"
    ANALYST_KEY = "vdl-demo-private-key.sk"  
    PROD_HANDLE = "C68A6E217650F072E8BA585C6E3446CFFE38CE9E44E6C9AEDBADBA47CAE1E218"
    SCHEMA = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'condition_x':'int',
        'category':'varchar',
        'batch_nr':'int',
        'smokes':'bool',
        'prescription':'int',
        'quantity':'int',
        'packaging':'int'
    } 
    PROD_HANDLE_2 = "74FE5E767FB36E084BE5B7EFD692C880DF99953CD4BE834C4193713A48A8CD17"
    SCHEMA_2 = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'batch_nr':'int',
        'smokes':'bool',
        'condition_y':'bool'
    } 

    # fill this in yourself (no spaces):
    ANALYSIS_NAME = "analysis"


    import crandas as cd

    cd.connect(DESIGN_ENV)

    cd.get_table(DUMMY_HANDLE,dummy_for=PROD_HANDLE) # link dummy table to production handle
    cd.get_table(DUMMY_HANDLE_2,dummy_for=PROD_HANDLE_2) # link dummy table to production handle

    script = cd.script.record(name=ANALYSIS_NAME)

    ###### BEGIN: COPY TO AUTHORIZED SCRIPT ################

    table = cd.get_table(PROD_HANDLE, schema=SCHEMA)
    table2 = cd.get_table(PROD_HANDLE_2, schema=SCHEMA_2)

    # inner join table and table2 on identical keys
    merged = cd.merge(table, table2, how="inner", on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'], validate="1:1")
    repr(merged)

    # retrieve statistics
    print('Average occurance of condition_y: ' + str(merged['condition_y'].mean()))

    smokers = merged.filter(merged['smokes'] == True)
    print('Average occurance of condition_y for smokers: ' + str(smokers['condition_y'].mean()))

    def compute_sum(month):
        try:
            result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
            return result             # computes means only if there are 3 cases for the month
        except:
            return None                # if not, no output is given

    dic={"Month": ['Jan', 'May', 'September'], "Sum of Asthma": [
        compute_sum(1),compute_sum(5),compute_sum(9)]}

    # Data visualization is done with data that is already in the clear, so it does not need to be within the script recording.
    # But it can
    import plotly.express as px

    plot=px.bar(dic, x="Month", y="Sum of Asthma")
    plot.show()

    ###### END: COPY TO AUTHORIZED SCRIPT ################

    script.save(ANALYSIS_NAME + '.recording')


This will produce a ``.recording`` file, which we can get approved and run on sensitive production data (see also step 8 in 
`this <https://support.rosemanlabs.com/how-do-i-create-an-analysis>`_) article. 

Authorized
----------
Once we have downloaded our ``.approved`` file and stored it in the same location as the script, we can use the code below to run the script on the (sensitive) production data. The authorized script differs from the design one in the following way:

- We connect to the authorized environment here, instead of the design environment
- We refer to the analyst key in the script, to be able to authorize the analyst when running the approved script
- Instead of ``script.record()`` to record the analysis, we use ``script.load()`` to load the approval file and run it
- Instead of ``script.save()``, we use ``script.reset()`` to close the script
 
To this end, do the following:

- Replace the capitalized environment variables by those in the design script
- Copy the part between ``#BEGIN`` and ``#END`` into the script at ``COPY FROM DESIGN SCRIPT``

.. code:: python

    # copy from design platform:
    DESIGN_ENV = ""
    DUMMY_HANDLE = ""

    # copy from authorized platform:
    AUTH_ENV = ""
    ANALYST_KEY = "" 
    PROD_HANDLE = ""
    SCHEMA = {''} 

    # fill this in yourself:
    ANALYSIS_NAME = ""

    import crandas as cd

    cd.connect(AUTH_ENV)
    cd.base.session.analyst_key = ANALYST_KEY

    script = cd.script.load(ANALYSIS_NAME + '.approved')

    ###### COPY FROM DESIGN SCRIPT ##############

    script.reset()


In our example, this will result in the following script:

.. code:: python

    # copy from design platform:
    DESIGN_ENV = "vdl-design-demo"
    DUMMY_HANDLE = "359BF2DD26A4EE2144B1C4D92223AA9D3A57280B4DE29149FAC2B5430EDFE5ED"
    DUMMY_HANDLE_2 = "49B97E4FEFA4E700D781DAACC678719245D3C60FD355F67121B1E3C9A12185CD"

    # copy from authorized platform:
    AUTH_ENV = "vdl-demo"
    ANALYST_KEY = "vdl-demo-private-key.sk"  
    PROD_HANDLE = "C68A6E217650F072E8BA585C6E3446CFFE38CE9E44E6C9AEDBADBA47CAE1E218"
    SCHEMA = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'condition_x':'int',
        'category':'varchar',
        'batch_nr':'int',
        'smokes':'bool',
        'prescription':'int',
        'quantity':'int',
        'packaging':'int'
    } 
    PROD_HANDLE_2 = "74FE5E767FB36E084BE5B7EFD692C880DF99953CD4BE834C4193713A48A8CD17"
    SCHEMA_2 = {
        'index':'int',
        'year':'int',
        'month':'int',
        'day':'int',
        'article_nr':'int',
        'batch_nr':'int',
        'smokes':'bool',
        'condition_y':'bool'
    } 

    # fill this in yourself (no spaces):
    ANALYSIS_NAME = "analysis"

    import crandas as cd
    import crandas.script

    cd.base.session.connect(AUTH_ENV)
    cd.base.session.analyst_key = ANALYST_KEY

    script = cd.script.load(ANALYSIS_NAME + '.approved')

    
    table = cd.get_table(PROD_HANDLE, schema=SCHEMA)
    table2 = cd.get_table(PROD_HANDLE_2, schema=SCHEMA_2)

    # inner join table and table2 on identical keys
    merged = cd.merge(table, table2, how="inner", 
        on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'], validate="1:1")
    repr(merged)

    # retrieve statistics
    print('Average occurance of condition_y: ' + str(merged['condition_y'].mean()))

    smokers = merged.filter(merged['smokes'] == True)
    print('Average occurance of condition_y for smokers: ' + str(smokers['condition_y'].mean()))

    def compute_sum(month):
        try:
            result = merged[((merged["condition_y"]==1) & (merged['month']==month)
                ).with_threshold(3)]["condition_y"].sum()
            return result             # computes means only if there are 3 cases for the month
        except:
            return None                # if not, no output is given

    dic={"Month": ['Jan', 'May', 'September'], "Sum of Asthma": [
        compute_sum(1),compute_sum(5),compute_sum(9)]}

    # Data visualization is done with data that is already in the clear, so it does not need 
    # to be within the script recording, but it can.
    import plotly.express as px

    plot=px.bar(dic, x="Month", y="Sum of Asthma")
    plot.show()


    script.reset()

We have now executed a script on private data only after receiving the appropriate permissions. As we will see, running any crandas command that was not part of the original approved script 
will result in a ``NoMatchingAuthorization``-error.