Data structures in crandas#

In crandas, there are number of data structures that you need to know in order to get started. The two fundamental ones that you should be familiar with are CSeries and CDataFrames. These are very similar to the Series and DataFrame objects that you find in pandas too, so they may be familiar to you already.

A CDataFrame represents a table and is made out of CSeries that represent the columns. They work similarly to the pandas DataFrame and Series respectively, although they are structurally very different. These classes allow you to interact with the tables that are stored in the Virtual Data Lake, but do not actually contain any data (in fact, they’re not even actual tables!). Their values are remotely stored in secret-shared form across a number of servers.

DataFrames and Series#

CDataFrame#

A CDataFrame represents a table and allows you to interact with the table stored in the VDL as if it were a python object. For most cases, you can treat it was if it simply were a table. For example, if we have the name of a column, we can use standard bracket notation to access it:

import crandas as cd

df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]})
colA = df['A']

There are some important differences. As expected, attempting to print a CDataFrame will not print the table, but just some of the metadata that crandas needs to keep track of the table

>>> print(df)
7C7621642C15C81A52D6DA9562E3C5F8...[2x2]

Note

Due to its structure, a CDataFrame does not have an index.

CSeries#

A CSeries is an object representing a column or a function applied to one or more table columns. Unlike a CDataFrame, it is not as straightforward to manipulate a CSeries directly. crandas processes operations in a lazy way, only communicating with the server whenever an output is needed. The object a CSeries represents can be accessed through CSeries.as_table().

df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]})

series1 = df["A"]
series2 = df["A"] + 1
>>> print(series1)
    <crandas.crandas.CSeriesColRef object at 0x7fa2cc1cec80>
>>> print(series2)
    <crandas.crandas.CSeriesFun object at 0x7fa2aa4505e0>
>>> print(series1.as_table().open())
        0  1
        1  2
>>> print(series2.as_table(column_name="A+1").open())
        A+1
    0    2
    1    3

The names CSeriesFun and CSeriesColRef refer to types of CSeries, the former referring to a function and the latter to a column reference.

Columns#

When accessing a column of a CDataFrame we receive a CSeriesColRef. This crandas object contains additional information, like the type of the column, as well as the index of that column in that table. These are accessed by CSeriesColRef.ix() and CSeriesColRef.type() respectively.

>>> df["B"].ix() # Show the index of column "B".
1
>>> df["B"].type()
Col("B", "i", 1)

Where the i in Col("B", "i", 1) refers to the fact that the column contains integer values. For more information on this, you can look at Working with numeric data.

In the next section, we will see how to do simple tabular manipulations of our CDataFrames.