.. _salesparty1:

Joining data - [Party 1]
===================================

In this tutorial, we will walk through a step-by-step process of analyzing sales and nutrition data using the crandas library.

Step 1: Start the VDL
~~~~~~~~~~~~~~~~~~~~~~~

First, we import the necessary libraries, including crandas (``cd``), pandas (``pd``) and other necessary packages.

.. code:: python

    import crandas as cd
    import pandas as pd
    import plotly.express as px

Next, we update the base path to where the VDL (Virtual Data Lake) certificates are stored, and set the VDL endpoint.

.. code:: python
    
    force_reupload = False
    
    # On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
    # Set the session base_path and session.endpoint manually when executing this notebook in any other environment
    # Uncomment the following 4 commented lines and change their value to set these session variables
    
    # from crandas.base import session
    # from pathlib import Path
    
    # session.base_path = Path('base/path/to/vdl/secrets') 
    # session.endpoint = 'https://localhost:9820/api/v1'
    

Step 2: Input data
~~~~~~~~~~~~~~~~~~~~

We set the number of rows to read for each dataset and define the table name for the food sales data.

.. code:: python

    #Set this to limit the amount of data
    rows_per_dataset = 1000 
    food_sales_table_name = 'food_sales'

We then select the relevant columns that we need for our analysis.

.. code:: python

    # Select relevant columns
    relevant_columns = ['Year',
                        'ProductNumber',
                        'ProductName',
                        'EAN_Number',
                        'QuantitySold']

Read the local file and upload it
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We read the food sales data from a local CSV file and limit the number of rows to our specified value (1000). After that, we upload the data to the VDL using :func:`cd.upload_pandas_dataframe`

.. code:: python

    # Read in the csv to pandas
    food_sales_data = pd.read_csv('../../data/sales_nutrition_data/food_sales_data.csv', nrows=rows_per_dataset)

    # Upload the pandas dataframe to the VDL
    food_sales_table = cd.upload_pandas_dataframe(food_sales_data[relevant_columns], name=food_sales_table_name) 
    
    # Show metadata for the table (column titles and field types, i=integer, s=string)
>>> print("Table meta-data:", repr(food_sales_table))
    Table meta-data: Name: 4D172B6A3F7412FE967F17D1B5BB98755EAB558B81EB76137849AB20E0F2C6F4
    Size: 1000 rows x 5 columns
    CIndex([Col("Year", "i", 1), Col("ProductNumber", "i", 1), Col("ProductName", "s", 18), Col("EAN_Number", "s", 7), Col("QuantitySold", "i", 1)])

Step 3: Access the Other Party's Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. important::
    Before this step, it is important to execute the script :ref:`salesparty2` and get the table handle to paste in the following code.

Now we can access and connect to the second table containing nutrition data. 

.. code:: python

    # Show that we can access and connect to the second table
    nutrition_table_handle = 'NUTRITION_TABLE_HANDLE_GOES_HERE'
    
    # Access the table
    nutrition_table = cd.get_table(nutrition_table_handle)


Step 4: Join tables
~~~~~~~~~~~~~~~~~~~
We perform an inner join on the food sales table and the nutrition table using the EAN code as the key, and account for the fact that they have different column names in each table. 

.. code:: python

    #Join the two datasets using an inner join on EAN code 
    joined_table = cd.merge(food_sales_table, nutrition_table, how="inner", left_on='EAN_Number', right_on='ean_code')


Step 5: Run the analyses
~~~~~~~~~~~~~~~~~~~~~~~~

We compute the average number of products sold and the number of products with zero sales for some simple data quality checks. 

.. code:: python

    avg_sales = joined_table['QuantitySold'].mean()
    zero_sales = sum(joined_table['QuantitySold'] == 0)
    
>>> print('Average number of products sold:', "%.0f" % avg_sales)
    Average number of products sold: 23909

>>> print('Number of products with zero sales:', zero_sales)
    Number of products with zero sales: 2

In the next step, we define a list of labels that represent different nutritional elements in the datasets.

.. code:: python

    labels_list = ['ENERGY_VALUE_IN_KJ',
                    'ENERGY_VALUE_IN_KCAL',
                    'SODIUM_IN_MG',
                    'SATURATED_FATTY_ACIDS_IN_G',
                    'TOTAL_PROTEIN_IN_G',
                    'MONO_AND_DISACCHARIDES_IN_G',
                    'DIETARY_FIBER_IN_G']


Now, we compute the total nutritional values for each product by multiplying the nutritional element columns by the number of items sold (``AantalVerkocht``) and store these values in new columns with the suffix ``_tot``.

.. code:: python

    # For each label (k) listed above, create a "totals" column that multiplies column k times "AantalVerkocht"
    merged = joined_table.assign(**{k + "_tot":(joined_table[k] * joined_table['QuantitySold']).astype("int24") for k in labels_list})
    
After this, we set the year to 2019, define a new list of labels without ``ENERGY_VALUE_IN_KJ`` (calories in kilojoules), and create a dictionary ``coicop`` to map product numbers to their names. We also create an empty list named ``table`` to store the data.

.. code:: python

    year = 2019
    
    labels_list = ['ENERGY_VALUE_IN_KCAL',
                'SODIUM_IN_MG',
                'SATURATED_FATTY_ACIDS_IN_G',
                'TOTAL_PROTEIN_IN_G',
                'MONO_AND_DISACCHARIDES_IN_G',
                'DIETARY_FIBER_IN_G']
    
    coicop = {1113: 'Bread',
          1117: 'Breakfast cereals',
          1175: 'Chips',
          1183: 'Chocolate',
          1222: 'Soft drinks',
          1184: 'Candy'}
    
    table = []

In this loop, we iterate through the ``coicop`` dictionary and ``labels_list``, calculating the total nutritional values for each product and nutritional element in the specified year. We then append this data to the ``table`` list and create a ``DataFrame`` ``df`` with the appropriate column names.

.. code:: python  

    for key,val in coicop.items():
        for label in labels_list:
            sub = []
            sub.append(key)
            sub.append(val)
            sub.append(label)
            sub.append(sum(merged[(merged["ProductNumber"]==key) & (merged["Year"]==year)][label+"_tot"]))
            table.append(sub)
        
    df = pd.DataFrame(table, columns =['ProductNumber', 'ProductName', 'Element', 'Totaal'])
    df.head()

Finally, we create a bar chart using the data from the ``df`` ``DataFrame``. The x-axis represents the nutritional elements, the y-axis represents the total nutritional value, and the bars are colored by product names. The chart's title displays the year of analysis (2019).

.. code:: python

    fig = px.bar(df, x="Element", y="Totaal", color="ProductName", text_auto=False, title='Total intake for: '+str(year), height=800)
    
    fig.show()
    
.. image:: 1st-party_files/1st-party_20_0.png