.. _salesparty2:

Joining data - [Party 2]
===================================

In this tutorial, we will demonstrate how the second party in a collaboration would upload their data to the VDL analysis to be performed by :ref:`Party 1<salesparty1>`.

Step 1: Start the VDL
~~~~~~~~~~~~~~~~~~~~~ 

In this step, we import the necessary libraries, including crandas and pandas.

.. code:: python

    import crandas as cd
    import pandas as pd
    from pathlib import Path

If you aren't working in an RL-hosted jupyter environment, then set up the VDL session by specifying the location of the VDL certificates and the VDL endpoint URL. These certificates are essential to authenticate the connection between our code and the VDL.
This is **not needed** if you are using a jupyter environment hosted by us (see below).  

.. code:: python

    
    force_reupload = False
    
    # On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
    # Set the session base_path and session.endpoint manually when executing this notebook in any other environment
    # Uncomment the following 4 commented lines and change their value to set these session variables
    
    # from crandas.base import session
    # from pathlib import Path
    
    # session.base_path = Path('base/path/to/vdl/secrets') 
    # session.endpoint = 'https://localhost:9820/api/v1'
    

Step 2: Input data
~~~~~~~~~~~~~~~~~~

In this step, we define some parameters to limit the amount of data and set a table name as a reference for the other party (same as for Party 1). We also specify the relevant columns that we will use in our analysis.

.. code:: python

    #Set this to limit the amount of data
    rows_per_dataset = 1000
    # Use this name as a reference for the other party 
    nutrition_table_name = 'nutrition'

.. code:: python

    # Select relevant columns
    relevant_columns = ['ean_code',
                        'SPP_DESCRIPTION',
                        'ENERGY_VALUE_IN_KJ',
                        'ENERGY_VALUE_IN_KCAL',
                        'SODIUM_IN_MG',
                        'SATURATED_FATTY_ACIDS_IN_G',
                        'TOTAL_PROTEIN_IN_G',
                        'MONO_AND_DISACCHARIDES_IN_G',
                        'DIETARY_FIBER_IN_G',
                        'STANDARD_PORTION_SIZE',
                        'NUMBER_OF_PORTIONS_PER_PACKAGE']

Read the local file and upload it:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, we read the local CSV file containing the nutrition data using pandas and limit the number of rows based on the rows_per_dataset we set earlier. We then convert the ``ean_code`` column to a string data type.

.. code:: python

    file_path = '../../data/sales_nutrition_data/nutrition_data.csv'
    
    # Read the local csv using pandas
    nutrition_table_data = pd.read_csv(file_path, nrows=rows_per_dataset)
    
    # Upload nutrition data to the VDL
    nutrition_table = cd.upload_pandas_dataframe(nutrition_table_data[relevant_columns]) 
        
    # show metadata for the table (column titles and field types, i=integer, s=string)
    print("Table meta-data:\n", repr(nutrition_table))

.. parsed-literal::

    Reading data...
    Uploading data...
    Table meta-data:
    Handle: 94EDD8E55C616AEB5C734AE212064DA5DFB4D59018BD8DCC8DC44C3F71BF610F
    Size: 1000 rows x 11 columns
    CIndex([Col("ean_code", "s", 14), Col("SPP_DESCRIPTION", "s", 14), Col("ENERGY_VALUE_IN_KJ", "i", 1), Col("ENERGY_VALUE_IN_KCAL", "i", 1), Col("SODIUM_IN_MG", "i", 1), Col("SATURATED_FATTY_ACIDS_IN_G", "i", 1), Col("TOTAL_PROTEIN_IN_G", "i", 1), Col("MONO_AND_DISACCHARIDES_IN_G", "i", 1), Col("DIETARY_FIBER_IN_G", "i", 1), Col("STANDARD_PORTION_SIZE", "i", 1), Col("NUMBER_OF_PORTIONS_PER_PACKAGE", "i", 1)])

.. important::
    
    When executing this script, the table handle will be different. Copy it and paste it in the right place in the :ref:`Party 1<salesparty1>` script.

Now our data is in the database, ready to be used by another party by simply retrieving it by name.