Data structures in crandas¶
In crandas, there are number of data structures that you need to know in order to get started. The two fundamental ones that you should be familiar with are CSeries
and CDataFrames
.
These are very similar to the Series
and DataFrame
objects that you find in pandas too, so they may be familiar to you already.
A CDataFrame
represents a table and is made out of CSeries
that represent the columns. They work similarly to the pandas DataFrame
and Series
respectively, although they are structurally very different.
These classes allow you to interact with the tables that are stored in the engine, but do not actually contain any data (in fact, they’re not even actual tables!).
Their values are remotely stored in secret-shared form across a number of servers.
DataFrames and Series¶
CDataFrame
¶
A CDataFrame
represents a table and allows you to interact with the table stored in the engine as if it were a python object. For most cases, you can treat it was if it simply were a table. For example, if we have the name of a column, we can use standard bracket notation to access it:
import crandas as cd
df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]})
colA = df['A']
There are some important differences. As expected, attempting to print a CDataFrame
will not print the table, but just some of the metadata that crandas needs to keep track of the table
>>> print(df)
7C7621642C15C81A52D6DA9562E3C5F8...[2x2]
Note
Due to its structure, a CDataFrame
does not have an index.
CSeries
¶
A CSeries
is an object representing a column or a function applied to one or more table columns. Unlike a CDataFrame
, it is not as straightforward to manipulate a CSeries
directly. crandas processes operations in a lazy way, only communicating with the server whenever an output is needed. The object a CSeries
represents can be accessed through CSeries.as_table()
.
df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]})
series1 = df["A"]
series2 = df["A"] + 1
>>> print(series1)
<crandas.crandas.CSeriesColRef object at 0x7fa2cc1cec80>
>>> print(series2)
<crandas.crandas.CSeriesFun object at 0x7fa2aa4505e0>
>>> print(series1.as_table().open())
0 1
1 2
>>> print(series2.as_table(column_name="A+1").open())
A+1
0 2
1 3
The names CSeriesFun
and CSeriesColRef
refer to types of CSeries
, the former referring to a function and the latter to a column reference.
Columns¶
When accessing a column of a CDataFrame
we receive a CSeriesColRef
. This crandas object contains additional information, like the type of the column, as well as the index of that column in that table. These are accessed by CSeriesColRef.ix()
and CSeriesColRef.type()
respectively.
>>> df["B"].ix() # Show the index of column "B".
1
>>> df["B"].type()
Col("B", "i", 1)
Where the i
in Col("B", "i", 1)
refers to the fact that the column contains integer values. For more information on this, you can look at Working with numeric data.
In the next section, we will see how to do simple tabular manipulations of our CDataFrames
.