Joining data - [Party 2]¶

In this tutorial, we will demonstrate how the second party in a collaboration would upload their data to the engine analysis to be performed by Party 1.

Step 1: Start the Roseman Labs Engine¶

In this step, we import the necessary libraries, including crandas and pandas.

import crandas as cd
import pandas as pd
from pathlib import Path

Step 2: Input data¶

In this step, we define some parameters to limit the amount of data and set a table name as a reference for the other party (same as for Party 1). We also specify the relevant columns that we will use in our analysis.

#Set this to limit the amount of data
rows_per_dataset = 1000
# Use this name as a reference for the other party
nutrition_table_name = 'nutrition'

# Select relevant columns
relevant_columns = ['ean_code',
                    'SPP_DESCRIPTION',
                    'ENERGY_VALUE_IN_KJ',
                    'ENERGY_VALUE_IN_KCAL',
                    'SODIUM_IN_MG',
                    'SATURATED_FATTY_ACIDS_IN_G',
                    'TOTAL_PROTEIN_IN_G',
                    'MONO_AND_DISACCHARIDES_IN_G',
                    'DIETARY_FIBER_IN_G',
                    'STANDARD_PORTION_SIZE',
                    'NUMBER_OF_PORTIONS_PER_PACKAGE']

Read the local file and upload it:¶

Now, we read the local CSV file containing the nutrition data using pandas and limit the number of rows based on the rows_per_dataset we set earlier. We then convert the ean_code column to a string data type.

file_path = '../../data/sales_nutrition_data/nutrition_data.csv'

# Read the local csv using pandas
nutrition_table_data = pd.read_csv(file_path, nrows=rows_per_dataset)

# Upload nutrition data to the engine
nutrition_table = cd.upload_pandas_dataframe(nutrition_table_data[relevant_columns])

# show metadata for the table (column titles and field types, i=integer, s=string)
print("Table meta-data:\n", repr(nutrition_table))

Reading data...
Uploading data...
Table meta-data:
Handle: 94EDD8E55C616AEB5C734AE212064DA5DFB4D59018BD8DCC8DC44C3F71BF610F
Size: 1000 rows x 11 columns
CIndex([Col("ean_code", "s", 14), Col("SPP_DESCRIPTION", "s", 14), Col("ENERGY_VALUE_IN_KJ", "i", 1), Col("ENERGY_VALUE_IN_KCAL", "i", 1), Col("SODIUM_IN_MG", "i", 1), Col("SATURATED_FATTY_ACIDS_IN_G", "i", 1), Col("TOTAL_PROTEIN_IN_G", "i", 1), Col("MONO_AND_DISACCHARIDES_IN_G", "i", 1), Col("DIETARY_FIBER_IN_G", "i", 1), Col("STANDARD_PORTION_SIZE", "i", 1), Col("NUMBER_OF_PORTIONS_PER_PACKAGE", "i", 1)])

Important

When executing this script, the table handle will be different. Copy it and paste it in the right place in the Party 1 script.

Now our data is in the database, ready to be used by another party by simply retrieving it by name.