First steps¶
Once you have installed the crandas environment as explained in Installation (on-premise) you can start using it. To start, we import the pandas
and crandas
modules:
import crandas as cd
import pandas as pd
If you’re experiencing issues with crandas or just want to confirm that you’re using the latest version, it’s useful to know how to check which version of crandas you’re running. This is done with the following:
from crandas.base import session
session.version
Uploading data¶
Uploading data can be done either through the web portal or directly through crandas.
With crandas, one easy way is to import a CSV file directly:
demo_table = cd.read_csv("/gs_data/demo_table.csv")
We use the read_csv()
function in order to read data that is stored as a CSV file.
Alternatively, you can import a pandas DataFrame
with the upload_pandas_dataframe()
method.
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Eve"],
"Trusted": ["Yes", "Yes", "No"],
"Age": [52, 24, 39],
})
table = cd.upload_pandas_dataframe(df, auto_bounds=True)
Using this method, we can upload a wide variety of files, as pandas supports many different file formats (csv, excel, sql, json…) and each can be read in using read_* (where * is the file type). Therefore, as long as the data can be read into a pandas DataFrame
, it can be uploaded to the Virtual Data Lake using upload_pandas_dataframe()
.
Every table uploaded to the Virtual Data Lake has a unique handle. For example, to see the handle of the table uploaded just now, we can do:
>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])
The Name
field is the handle of the table.
To access a table that was uploaded through the web portal, copy its handle from the portal, and then use the get_table()
function:
>>> table = cd.get_table("9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E")
>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])
When uploading data through crandas, we can attach a name instead of a handle as well:
named_table = cd.upload_pandas_dataframe(df, name="demo")
The name “demo” now refers to this new table, even when a table with the name “demo” previously existed.
To access a named table, we also use the get_table()
function:
>>> cd.get_table("demo")
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])
After uploading data, you can start your analysis.
Trying to access the data¶
crandas is a system where data can be shared confidentially, so what happens when someone has access to a table that you uploaded? Can they simply print it to see all the values? The answer is no. Trying to simply print the table only outputs a prefix of the unique ID of the table (a hexadecimal number) and the dimensions (number of rows and columns) of the table:
>>> print(table)
4BD87DAB1330328783D4AB9C59D8A4F6...[3x3]
This is as much information as we are willing to leak.
Selecting and manipulating data¶
Once you have your table table
, you can select and manipulate its rows and columns. You can access a column by its name:
ages = table["Age"]
As mentioned, you cannot access the values in ages
directly, but you can perform computation on them.
If you want to know the average age or the sum of the ages of the patients in the health datasets, you may calculate the sum and mean as follows:
>>> ages.sum()
115
>>> ages.mean()
38.333333333333336
These calculations were produced using the table we defined in the previous section. There are various statistics available that can be applied to any columns containing numerical data.
While you cannot simply add a new column to a CDataFrame
, you may use the assign
method to calculate new columns from existing table values.
Say you wanted to have a column where you took all the names and made them lowercase:
table = table.assign(lower_name=table["Name"].lower())
In order to confirm that the function has worked as you expected, we can use the open()
function belonging to CDataFrame
.
As the name suggests, this function allows the user to open the values of the table, so this function should be used with caution as it will reveal secret data.
Here we use it simply as an example to illustrate what you have done.
In production environments, this can only be done upon approval by all data owners.
>>> table.open()
Name Trusted Age lower_name
0 Alice Yes 52 alice
1 Bob Yes 24 bob
2 Eve No 39 eve
You can see that your table now has a column with the names in lower case.
More interestingly, you can make computations with the values in your tables. To show this, we’ll create a new table with more numerical values.
shapes = cd.DataFrame({
"id": [1, 2, 3],
"length": [32, 86, 21],
"height": [41, 66, 11],
}, auto_bounds=True)
Now, we can find the area of each shape by multiplying the length
and height
columns into a new column area
.
shapes = shapes.assign(area=shapes["length"]*shapes["height"])
>>> shapes.open()
id length height area
0 1 32 41 1312
1 2 86 66 5676
2 3 21 11 231
We can also perform comparisons.
Suppose that there is a minimum amount of space that we can assign a shape. Let’s say that is 1000.
If a shape is smaller than that, we still need to assign 1000, if it is bigger we assign its area.
We can do this by using the CSeries.if_else()
function:
>>> shapes.assign(space=(shapes["area"] > 1000).if_else(shapes["area"], 1000)).open()
id length height area space
0 1 32 41 1312 1312
1 2 86 66 5676 5676
2 3 21 11 231 1000
To select numbered rows of the table, we can use python’s slice notation. For example, to select the first two rows:
>>> shapes[:2].open()
id length height area
0 1 32 41 1312
1 2 86 66 5676
The syntax is [start:stop]
, where start
and stop
are indices of the rows. This selects the rows with indices from start
to stop
, excluding stop
, so start, start + 1, start + 2, ..., stop - 1
. If start
is omitted, it is 0, and if stop
is omitted, it is the length of the table.
We can also supply a step size, with the [start:stop:step]
syntax. For example, to get every other element:
>>> shapes[::2].open()
id length height area
0 1 32 41 1312
1 3 21 11 231
Alternatively, you might want to select rows based on the value of certain columns. You can do this by creating a filter. Let’s say we want to select only “tall” shapes (shapes that are taller than they are long). We can do the following:
tall_shape_filter = shapes["height"] > shapes["length"]
tall_shapes = shapes[tall_shape_filter]
>>> tall_shapes.open()
id length height area
0 1 32 41 1312
You can now take a table, perform computations over it and then select certain records based on the data on it. This is already enough to do many computations on the data. However, crandas is a tool made for data sharing. In the next section you will learn how to combine different data sources to do exactly that.
Working with thresholds¶
The tools that you have learned to use provide us with many ways to interact with the data.
As we have seen, operations like sum
and mean
reveal data about a column.
To avoid revealing data on individual records, many functions like CDataFrame.filter()
have a threshold
parameter that prevents certain information being output if it does not have enough rows of data.
Organizations may choose what threshold is acceptable, and queries of their data will have to respect it, as seen here:
>>> tall_shapes = shapes[tall_shape_filter.with_threshold(2)]
ServerError: Table cannot be filtered: remaining number of items would be below specified threshold (error code: ServerErrorCode.CLIENT_ERROR_THRESHOLD_NOT_MET)