Joining data - [Party 2]¶
In this tutorial, we will demonstrate how the second party in a collaboration would upload their data to the engine analysis to be performed by Party 1.
Step 1: Start the Roseman Labs Engine¶
In this step, we import the necessary libraries, including crandas and pandas.
import crandas as cd
import pandas as pd
from pathlib import Path
Step 2: Input data¶
In this step, we define some parameters to limit the amount of data and set a table name as a reference for the other party (same as for Party 1). We also specify the relevant columns that we will use in our analysis.
#Set this to limit the amount of data
rows_per_dataset = 1000
# Use this name as a reference for the other party
nutrition_table_name = 'nutrition'
# Select relevant columns
relevant_columns = ['ean_code',
'SPP_DESCRIPTION',
'ENERGY_VALUE_IN_KJ',
'ENERGY_VALUE_IN_KCAL',
'SODIUM_IN_MG',
'SATURATED_FATTY_ACIDS_IN_G',
'TOTAL_PROTEIN_IN_G',
'MONO_AND_DISACCHARIDES_IN_G',
'DIETARY_FIBER_IN_G',
'STANDARD_PORTION_SIZE',
'NUMBER_OF_PORTIONS_PER_PACKAGE']
Read the local file and upload it:¶
Now, we read the local CSV file containing the nutrition data using pandas and limit the number of rows based on the rows_per_dataset we set earlier. We then convert the ean_code
column to a string data type.
file_path = '../../data/sales_nutrition_data/nutrition_data.csv'
# Read the local csv using pandas
nutrition_table_data = pd.read_csv(file_path, nrows=rows_per_dataset)
# Upload nutrition data to the engine
nutrition_table = cd.upload_pandas_dataframe(nutrition_table_data[relevant_columns])
# show metadata for the table (column titles and field types, i=integer, s=string)
print("Table meta-data:\n", repr(nutrition_table))
Reading data...
Uploading data...
Table meta-data:
Handle: 94EDD8E55C616AEB5C734AE212064DA5DFB4D59018BD8DCC8DC44C3F71BF610F
Size: 1000 rows x 11 columns
CIndex([Col("ean_code", "s", 14), Col("SPP_DESCRIPTION", "s", 14), Col("ENERGY_VALUE_IN_KJ", "i", 1), Col("ENERGY_VALUE_IN_KCAL", "i", 1), Col("SODIUM_IN_MG", "i", 1), Col("SATURATED_FATTY_ACIDS_IN_G", "i", 1), Col("TOTAL_PROTEIN_IN_G", "i", 1), Col("MONO_AND_DISACCHARIDES_IN_G", "i", 1), Col("DIETARY_FIBER_IN_G", "i", 1), Col("STANDARD_PORTION_SIZE", "i", 1), Col("NUMBER_OF_PORTIONS_PER_PACKAGE", "i", 1)])
Important
When executing this script, the table handle will be different. Copy it and paste it in the right place in the Party 1 script.
Now our data is in the database, ready to be used by another party by simply retrieving it by name.