Joining data - [Party 2]¶
In this tutorial, we will demonstrate how the second party in a collaboration would upload their data to the VDL analysis to be performed by Party 1.
Step 1: Start the VDL¶
In this step, we import the necessary libraries, including crandas and pandas.
import crandas as cd
import pandas as pd
from pathlib import Path
If you aren’t working in an RL-hosted jupyter environment, then set up the VDL session by specifying the location of the VDL certificates and the VDL endpoint URL. These certificates are essential to authenticate the connection between our code and the VDL. This is not needed if you are using a jupyter environment hosted by us (see below).
force_reupload = False
# On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
# Set the session base_path and session.endpoint manually when executing this notebook in any other environment
# Uncomment the following 4 commented lines and change their value to set these session variables
# from crandas.base import session
# from pathlib import Path
# session.base_path = Path('base/path/to/vdl/secrets')
# session.endpoint = 'https://localhost:9820/api/v1'
Step 2: Input data¶
In this step, we define some parameters to limit the amount of data and set a table name as a reference for the other party (same as for Party 1). We also specify the relevant columns that we will use in our analysis.
#Set this to limit the amount of data
rows_per_dataset = 1000
# Use this name as a reference for the other party
nutrition_table_name = 'nutrition'
# Select relevant columns
relevant_columns = ['ean_code',
'SPP_DESCRIPTION',
'ENERGY_VALUE_IN_KJ',
'ENERGY_VALUE_IN_KCAL',
'SODIUM_IN_MG',
'SATURATED_FATTY_ACIDS_IN_G',
'TOTAL_PROTEIN_IN_G',
'MONO_AND_DISACCHARIDES_IN_G',
'DIETARY_FIBER_IN_G',
'STANDARD_PORTION_SIZE',
'NUMBER_OF_PORTIONS_PER_PACKAGE']
Read the local file and upload it:¶
Now, we read the local CSV file containing the nutrition data using pandas and limit the number of rows based on the rows_per_dataset we set earlier. We then convert the ean_code
column to a string data type.
file_path = '../../data/sales_nutrition_data/nutrition_data.csv'
# Read the local csv using pandas
nutrition_table_data = pd.read_csv(file_path, nrows=rows_per_dataset)
# Upload nutrition data to the VDL
nutrition_table = cd.upload_pandas_dataframe(nutrition_table_data[relevant_columns])
# show metadata for the table (column titles and field types, i=integer, s=string)
print("Table meta-data:\n", repr(nutrition_table))
Reading data...
Uploading data...
Table meta-data:
Handle: 94EDD8E55C616AEB5C734AE212064DA5DFB4D59018BD8DCC8DC44C3F71BF610F
Size: 1000 rows x 11 columns
CIndex([Col("ean_code", "s", 14), Col("SPP_DESCRIPTION", "s", 14), Col("ENERGY_VALUE_IN_KJ", "i", 1), Col("ENERGY_VALUE_IN_KCAL", "i", 1), Col("SODIUM_IN_MG", "i", 1), Col("SATURATED_FATTY_ACIDS_IN_G", "i", 1), Col("TOTAL_PROTEIN_IN_G", "i", 1), Col("MONO_AND_DISACCHARIDES_IN_G", "i", 1), Col("DIETARY_FIBER_IN_G", "i", 1), Col("STANDARD_PORTION_SIZE", "i", 1), Col("NUMBER_OF_PORTIONS_PER_PACKAGE", "i", 1)])
Important
When executing this script, the table handle will be different. Copy it and paste it in the right place in the Party 1 script.
Now our data is in the database, ready to be used by another party by simply retrieving it by name.