First steps#

Note

To make uploading production data, generating keys and approving an analysis easier, Roseman Labs provides a web portal. This web portal is occasionally also referred to in this documentation. More information about the web portal and how to use it can be found on our knowledge base.

Once you have installed the crandas environment as explained in Installation (local python) you can start using it. To start, we import the pandas and crandas modules:

import crandas as cd
import pandas as pd

If you’re experiencing issues with crandas or just want to confirm that you’re using the latest version, it’s useful to know how to check which version of crandas you’re running. This is done with the following:

from crandas.base import session

session.version

Uploading data#

Uploading data can be done either through the web portal or directly through crandas. The steps for uploading via the web portal are explained in the knowledge base.

In crandas, there are several ways to upload data. One easy way is to import a CSV file directly:

demo_table = cd.read_csv("/gs_data/demo_table.csv")

We use the read_csv() function in order to read data that is stored as a CSV file.

Alternatively, you can import a pandas DataFrame with the upload_pandas_dataframe() method.

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Eve"],
    "Trusted": ["Yes", "Yes", "No"],
    "Age": [52, 24, 39],
})

table = cd.upload_pandas_dataframe(df, auto_bounds=True)

Using this method, we can upload a wide variety of files, as pandas supports many different file formats (csv, excel, sql, json…) and each can be read in using read_* (where * is the file type). Therefore, as long as the data can be read into a pandas DataFrame, it can be uploaded to the Virtual Data Lake using upload_pandas_dataframe().

Every table uploaded to the Virtual Data Lake has a unique handle. For example, to see the handle of the table uploaded just now, we can do:

>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

The Name field is the handle of the table.

To access a table that was uploaded through the web portal, copy its handle from the portal, and then use the get_table() function:

>>> table = cd.get_table("9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E")
>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

When uploading data through crandas, we can attach a name instead of a handle as well:

named_table = cd.upload_pandas_dataframe(df, name="demo")

The name “demo” now refers to this new table, even when a table with the name “demo” previously existed.

To access a named table, we also use the get_table() function:

>>> cd.get_table("demo")
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

After uploading data, you can start your analysis.

Trying to access the data#

crandas is a system where data can be shared confidentially, so what happens when someone has access to a table that you uploaded? Can they simply print it to see all the values? The answer is no. Trying to simply print the table only outputs a prefix of the unique ID of the table (a hexadecimal number) and the dimensions (number of rows and columns) of the table:

>>> print(table)
4BD87DAB1330328783D4AB9C59D8A4F6...[3x3]

This is as much information as we are willing to leak.

Selecting and manipulating data#

Once you have your table table, you can select and manipulate its rows and columns. You can access a column by its name:

ages = table["Age"]

As mentioned, you cannot access the values in ages directly, but you can perform computation on them.

If you want to know the average age or the sum of the ages of the patients in the health datasets, you may calculate the sum and mean as follows:

>>> ages.sum()
115
>>> ages.mean()
38.333333333333336

These calculations were produced using the table we defined in the previous section. There are various statistics available that can be applied to any columns containing numerical data.

While you cannot simply add a new column to a CDataFrame, you may use the assign method to calculate new columns from existing table values. Say you wanted to have a column where you took all the names and made them lowercase:

table = table.assign(lower_name=table["Name"].lower())

In order to confirm that the function has worked as you expected, we can use the open() function belonging to CDataFrame. As the name suggests, this function allows the user to open the values of the table, so this function should be used with caution as it will reveal secret data. Here we use it simply as an example to illustrate what you have done.

Note

In production environments, this can only be done upon approval by all data owners.

>>> table.open()
    Name Trusted  Age lower_name
0  Alice     Yes   52      alice
1    Bob     Yes   24        bob
2    Eve      No   39        eve

You can see that your table now has a column with the names in lower case.

More interestingly, you can make computations with the values in your tables. To show this, we’ll create a new table with more numerical values.

shapes = cd.DataFrame({
    "id": [1, 2, 3],
    "length": [32, 86, 21],
    "height": [41, 66, 11],
}, auto_bounds=True)

Now, we can find the area of each shape by multiplying the length and height columns into a new column area.

shapes = shapes.assign(area=shapes["length"]*shapes["height"])
>>> shapes.open()
   id  length  height  area
0   1      32     41  1312
1   2      86     66  5676
2   3      21     11   231

We can also perform comparisons. Suppose that there is a minimum amount of space that we can assign a shape. Let’s say that is 1000. If a shape is smaller than that, we still need to assign 1000, if it is bigger we assign its area. We can do this by using the CSeries.if_else() function:

>>> shapes.assign(space=(shapes["area"] > 1000).if_else(shapes["area"], 1000)).open()
   id  length  height  area  space
0   1      32     41  1312   1312
1   2      86     66  5676   5676
2   3      21     11   231   1000

To select numbered rows of the table, we can use python’s slice notation. For example, to select the first two rows:

>>> shapes[:2].open()
   id  length  height  area
0   1      32     41  1312
1   2      86     66  5676

The syntax is [start:stop], where start and stop are indices of the rows. This selects the rows with indices from start to stop, excluding stop, so start, start + 1, start + 2, ..., stop - 1. If start is omitted, it is 0, and if stop is omitted, it is the length of the table.

We can also supply a step size, with the [start:stop:step] syntax. For example, to get every other element:

>>> shapes[::2].open()
   id  length  height  area
0   1      32     41  1312
1   3      21     11   231

Alternatively, you might want to select rows based on the value of certain columns. You can do this by creating a filter. Let’s say we want to select only “tall” shapes (shapes that are taller than they are long). We can do the following:

tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
>>> tall_shapes.open()
   id  length  height  area
0   1      32     41  1312

You can now take a table, perform computations over it and then select certain records based on the data on it. This is already enough to do many computations on the data. However, crandas is a tool made for data sharing. In the next section you will learn how to combine different data sources to do exactly that.