Skip to content

crandas.crandas

Main crandas functionality: dataframes (DataFrame), series (CSeries), and analysis operations (e.g., merge())

CDataFrame = DataFrame module-attribute

Deprecated

Use of this alias is deprecated

Alias for DataFrame

CIndex(cols, **kwargs)

Index (set of columns) of a crandas DataFrame

For a regular DataFrame, this represents the columns (name and type) of the DataFrame.

For a deferred DataFrame (in a transaction, or resulting from a dry run), this represents the columns (name and type) that the result of an operation is expected to have based on its inputs. For such an expected column, the name is set, but the type and size ("elements per value") may be undefined.

Constructor

PARAMETER DESCRIPTION
cols

list of columns (Col)

TYPE: list

ctype property

Ctype corresponding to the CIndex; see Ctype Schemas

schema property

Schema corresponding to the CIndex; see Ctype Schemas

__eq__(other)

Checks equality with input

__getitem__(ix)

Returns name of column ix

__len__()

Returns number of columns

get_loc(name)

Get integer location for requested label

PARAMETER DESCRIPTION
name

column name label;

TYPE: str

RETURNS DESCRIPTION
int

index of column with name name

RAISES DESCRIPTION
KeyError

value not found

matches_template(expected)

Checks whether the number and names of columns fit a template

to_dict()

Returns column names in dictionary form

CSeries(**kwargs)

Bases: Summable

One dimensional array which represents either the column of a DataFrame or the result of applying a rowwise function to one or more columns of a cd.DataFrame

notnull = notna class-attribute instance-attribute

Alias for isna

DT(outer_instance)

Used to retrieve date units in the pandas way

all(*, mode='open', **query_args)

Computes whether all values of the boolean series are true

See here for a description of the arguments.

any(*, mode='open', **query_args)

Computes whether any value of the boolean series is true

See here for a description of the arguments.

as_series(**query_args)

Return crandas.crandas.ReturnValue representing the series

as_table(*, column_name='', **query_args)

Outputs a .DataFrame that has the CSeries as its only column

PARAMETER DESCRIPTION
column_name

name for the column in the resulting DataFrame

TYPE: str DEFAULT: ''

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

DataFrame having the expected CSeries as its only column

as_value(**query_args)

Interpret single-row CSeries as value

Interpret single-row CSeries as a constant. For example, without using as_value, a["col2"]+b["col2"] performs row-wise addition. With as_value, a single-row CSeries is interpreted as a single value instead of as a column, e.g., a["col2"]+b["col2"].as_value() adds the value of the row col2 of table b to each row of a["col2"].

This function can be used to work with values that remain secret to the servers that perform the computation, e.g.:

data = cd.DataFrame({"a": [1, 2, 3]})

# The value "1" is a part of the function definition and so becomes known
# to the servers
data[data["a"] == 1]

# The value "1" is derived from a single-row column of a private table and
# so remains hidden to the servers
filtervalue = cd.DataFrame({"filtervalue": [1]})["filtervalue"].as_value()
data[data["a"] == filtervalue]
PARAMETER DESCRIPTION
query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
ReturnValue

ReturnValue representing the value of the (only row of the) CSeries

astype(ctype, validate=False)

Converts output to a specific type

PARAMETER DESCRIPTION
ctype

Type to convert to. See data-types

TYPE: Ctypes type specification

validate

If set, validate that the resulting column is of the correct type, e.g., is an 8-bit integer when tp=uint8.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
CSeries

CSeries converted to given type

RAISES DESCRIPTION
EngineError

Conversion failed or not supported

b64decode()

Decodes a Base64-encoded string as a bytes column

b64encode()

Encodes a bytes column as a Base64-encoded string

bytes_to_hex()

Converts a bytes column to a lowercase hex string (like "3ec9")

capitalize()

Returns string with the first character converted to uppercase and all other ones to lowercase

contains(other)

Substring search

Searches for substring in the column

PARAMETER DESCRIPTION
other

Substring to search for

RETURNS DESCRIPTION
CSeriesFun

Result of search: 1 if substring is found, 0 otherwise

corr(other, method='pearson', min_periods=None)

Compute correlation with other CSeries. This process computes the Pearson correlation coefficient, which is a measure of linear correlation. For more information, see https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.

PARAMETER DESCRIPTION
other

CSeries with which to compute the correlation.

TYPE: CSeries

method

Method used to compute the correlation. Currently only "pearson" is supported.

TYPE: str DEFAULT: "pearson"

min_periods

Minimum number of observations needed to have a valid result.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
CSeries

Correlation with other.

count(*, mode='open', **query_args)

Computes the number of not-NULL elements of the series

See here for a description of the arguments.

day()

Returns the day of the month

RETURNS DESCRIPTION
int

day of the month

day_of_year()

Returns the day of the year

RETURNS DESCRIPTION
int

Number representing the day of the year

dayofyear()

Returns the day of the year

RETURNS DESCRIPTION
int

Number representing the day of the year

drop_duplicates(*, keep='first', inplace=False, ignore_index=True)

Remove duplicate rows from CSeries.

PARAMETER DESCRIPTION
keep

Determines which duplicates to keep. * 'first': only keep the first occurence. * 'last': only keep the last occurence. * 'any': keep a single random occurrence, this one is more efficient. * False: remove all occurences.

DEFAULT: 'first'

RETURNS DESCRIPTION
CSeries

CSeries with the duplicates removed

RAISES DESCRIPTION
ValueError

keep must be either "first", "last", "any" or False

encode(encoding='utf-8', *, output_size=None)

Encodes an ASCII varchar column as a bytes column

PARAMETER DESCRIPTION
encoding

The encoding in which to encode the string column. Currently, only "utf-8" is supported.

TYPE: string DEFAULT: "utf-8"

output_size

Maximum bytes length of the output column. By default equal to twice the maximum string length of the input column. Needs to be set when this default is not large enough, which can happen when the input contains many complex Unicode characters. Can also be set to a lower value to improve the performance of later operations.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
CSeries

CSeries with the encoded bytes column

fillna(nullval)

Replaces NULL values in the column by nullval

PARAMETER DESCRIPTION
nullval

Value to replace NULLs by

TYPE: rowwise function

filter_chars(chars)

Filters strings according to a set of permitted characters

Example
>>> import crandas as cd
>>> df = cd.DataFrame({"a": ["apple"]}, auto_bounds=True)
>>> df["a"].filter_chars(["a", "p", "e"]).open()
0    appe
PARAMETER DESCRIPTION
chars

Permitted characters

TYPE: str iterable

RETURNS DESCRIPTION
CSeries

CSeries with filtered strings

fullmatch(pattern, *args)

Regular expression matching

Matches column to a regular expression.

PARAMETER DESCRIPTION
pattern

Regular expression to match

TYPE: Re

args

Additional columns for the match (can be referred to by (?1) to (?9) in the regular expression, e.g., r".*(?1).*")

TYPE: list of crandas.crandas.CSeries][] DEFAULT: ()

RETURNS DESCRIPTION
CSeriesFun

Column containing the result of the matching: 1 if there is a match and 0 otherwise.

get(*, name='', **query_args)

Deprecated. Use .as_table() instead.

hex_to_bytes()

Converts a lowercase hex string (like "3ec9") as a bytes column

if_else(ifval, elseval, **query_args)

Select values from ifval or elseval depending on self

Acts as an if-else statement by selecting the value from ifval for rows where self is equal to one, and selecting the value from elseval for rows where self is equal to zero.

For example, (cdf["a"]>0).if_else(cdf["a"], 0) returns a CSeries that has the value cdf["a"] if the condition (cdf["a"]>0) is satisfied, and the value 0 otherwise. Use for example as cdf.assign(a=(cdf["a"]>0).if_else(cdf["a"], 0)).

This function can also be applied to two instances of DataFrame. This is useful for assigning values to multiple columns at the same time based on the same condition. In this case, a DataFrame is returned that, for each row, contains the values from ifval for rows where self is equal to one, and the values from elseval for rows where self is equal to zero. The arguments need to have the same set of columns.

In this latter variant, if the order of the columns of ifval and elseval is different, the columns are returned in the order of ifval. For example, (cdf["a"]>0).if_else(cdf, cdf2) select values from the dataframe cdf whenever (cdf["a"]>0), and from cdf2 otherwise. Since if_else is applied rowwise, it is important to ensure that the order of the rows in self, ifval, and elseval is the same, e.g., elseval should not be obtained from ifval by a filtering, by an inner join, or similar.

Tip

cond.if_else(ifval, elseval) is equivalent to ifval.where(cond, elseval).

PARAMETER DESCRIPTION
ifval

Value(s) if true

TYPE: DataFrame; or CSeries or compatible

elseval

Value(s) otherwise

TYPE: DataFrame; or CSeries or compatible

RETURNS DESCRIPTION
DataFrame (if both arguments are DataFrame) or CSeries (otherwise)

Values from ifval or elseval, depending on self

inner(other)

Inner product of two vectors

isin(values)

Check per row whether value is contained in the given list of values.

For example cdf["a"].isin([1,2,3]) checks, for each row, whether the value of the column a is equal to 1, 2, or 3. The provided values can also be functions, that are evaluated row-wise. For example, cdf["a"].isin([cdf["b"], cdf["c"]]) checks for each row whether the value of column a in that row is equal to the value of column b in that row or the value of column c in that row.

It is possible to use placeholders for individual values of the list (e.g., cdf["a"].isin([1, Any(2)]) is allowed) but not for the list as a whole (e.g., cdf["a"].isin(Any([1,2])) is not allowed).

PARAMETER DESCRIPTION
values

List of values to check against. Each value needs to be a valid rowwise function (e.g., a constant, a column cdf["col"] or a computation cdf["col"]+1)

TYPE: list

RETURNS DESCRIPTION
CSeries

Rowwise function (for use in filter, assign, ...) representing whether the given value is in the given list

isna()

Returns whether respective values are NULL, boolean inverse of notna

len()

Returns the character length of each element of the CSeries (only works for Cseries of type string or bytes)

RETURNS DESCRIPTION
CSeriesFun

CSeries of character lengths (for string) or number of bytes (for bytes)

lower(indices=None)

Returns string values in lowercase

PARAMETER DESCRIPTION
indices

Represents the letters to be modified, if None then the whole string is modified

TYPE: int list DEFAULT: None

RETURNS DESCRIPTION
CSeries

CSeries with lowercase strings

RAISES DESCRIPTION
ValueError

Invalid index for string length

mask(cond, other=pd.NA, *, inplace=False, axis=None, level=None)

Replace value where the condition is True.

Replacement is performed row-wise. Returns a CSeries that can be used with .assign(), etc.

Tip

elseval.mask(cond, ifval) is equivalent to cond.if_else(ifval, elseval).

PARAMETER DESCRIPTION
cond

Where cond is True, keep the original value. Where False, replace with corresponding value from other.

TYPE: CSeries or equivalent

other

Entries where cond is False wre replaced with corresponding value from other.

TYPE: CSeries or equivalent DEFAULT: NA

RETURNS DESCRIPTION
CSeries

Values from self, replaced by values from other conditional on cond

max(*, mode='open', **query_args)

Computes the maximum of the series

See here for a description of the arguments.

mean(*, mode='open', **query_args)

Computes the mean of the elements of the series.

See here for a description of the arguments.

min(*, mode='open', **query_args)

Computes the minimum of the series

See here for a description of the arguments.

month()

Returns the month in number format

RETURNS DESCRIPTION
int

month

notna()

Returns whether respective values are not NULL, boolean inverse of isna

open(*, limit=None, offset=None)

Returns column in opened form

PARAMETER DESCRIPTION
limit

Limit to the number of opened rows. The number of returned rows will be the minimum of the number of rows remaining and the provided limit.

TYPE: int DEFAULT: None

offset

Offset of the opened rows. Start returning rows from the provided 0-based index.

TYPE: int DEFAULT: None

sqrt()

Returns the element-wise square root of this column, which should be numeric.

Note: in case there are negative numbers in the column, this function will throw an error.

RETURNS DESCRIPTION
CSeries

CSeries containing the square roots of the inputs

std(*, mode='open', ddof=1, **query_args)

Computes the standard deviation of the series.

See here for a description of the arguments.

PARAMETER DESCRIPTION
ddof

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample std and a ddof of 0 to the population std.

TYPE: int DEFAULT: 1

strip()

Returns stripped string values

substitute(sub_dict, output_size=None)

Performs string substitution in a string column. Inputs are provided in a dictionary of the form "a": ["á", "à", "ä"] where the characters in the list ["á", "à", "ä"] will be substituted by the key "a". Substitution of substrings of more than one character is not currently supported.

PARAMETER DESCRIPTION
sub_dict

Dictionary where each key is the string (maybe be more than one character) to be added and each value is a list of characters to be substituted by the key

TYPE: str Dictionary

output_size

Maximum string length of the output column. By default equal to the maximum string length of the input column. Needs to be set when this default is not large enough, which can happen when substituting a character for multiple ones.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
CSeries

CSeries with modified strings

RAISES DESCRIPTION
TypeError

Values of sub_dict in substitute need to be a list of characters

TypeError

Every character should match at most one substitution.

sum(*, mode='open', **query_args)

Computes the sum of the elements of the series.

PARAMETER DESCRIPTION
mode

mode in which to perform queries that return objects ("open" / "defer" / "regular"), by default "open"

TYPE: str DEFAULT: 'open'

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
int / float / Deferred / DataFrame / DataFrame

Result of applicable type, depending on as_table and mode. Note that the return value is a regular Python int/float, rather than a numpy int/float.

sum_squares(*, mode='open', **query_args)

Computes the sum of squares of the elements of the series.

See here for a description of the arguments.

upper(indices=None)

Returns string values in uppercase

PARAMETER DESCRIPTION
indices

Represents the letters to be modified, if None then the whole string is modified

TYPE: int list DEFAULT: None

RETURNS DESCRIPTION
CSeries

CSeries with uppercase strings

RAISES DESCRIPTION
ValueError

Invalid index for string length

var(*, mode='open', ddof=1, **query_args)

Computes the variance of the series.

See here for a description of the arguments.

PARAMETER DESCRIPTION
ddof

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample variance and a ddof of 0 to the population variance.

TYPE: int DEFAULT: 1

vsum()

Sum the elements of a vector

weekday()

Returns the day of the week, where Monday is 0

RETURNS DESCRIPTION
int

Number representing the day of the week

where(cond, other=pd.NA, *, inplace=False, axis=None, level=None)

Replace value where the condition is False.

Replacement is performed row-wise. Returns a CSeries that can be used with .assign(), etc.

Tip

ifval.where(cond, elseval) is equivalent to cond.if_else(ifval, elseval).

PARAMETER DESCRIPTION
cond

Where cond is True, keep the original value. Where False, replace with corresponding value from other.

TYPE: CSeries or equivalent

other

Entries where cond is False wre replaced with corresponding value from other.

TYPE: CSeries or equivalent DEFAULT: NA

RETURNS DESCRIPTION
CSeries

Values from self, replaced by values from other conditional on cond

with_threshold(threshold)

Adds a threshold to the CSeries. When the column is used as a filtering column or in an aggregation operation, this threshold indicates the minimum number of items that need to be in the filtering result or have the aggregation taken over.

PARAMETER DESCRIPTION
threshold

minimum number of elements for operation to be allowed

TYPE: (int, non - negative)

withnull(nulls)

Set values indicated by nulls to NULL

PARAMETER DESCRIPTION
nulls

Indicates with Trues which rows to replace by NULL

TYPE: rowwise function

year()

Returns the year

RETURNS DESCRIPTION
int

year in 4 digits

CSeriesColRef(table, name, **kwargs)

Bases: CSeries

Column of DataFrame

Subclass of CSeries. Represents a column of a DataFrame df as accessed via df["colname"] or lambda x: x.colname.

ctype property

Ctype for the column; see Ctype Schemas

schema property

Ctype for the column; see Ctype Schemas

all(*, as_table=False, threshold=None, **query_args)

Computes whether all values of the boolean series are true

See here for a description of the arguments.

any(*, as_table=False, threshold=None, **query_args)

Computes whether any value of the boolean series is true

See here for a description of the arguments.

count(*, as_table=False, threshold=None, **query_args)

Computes the number of not-NULL elements of the series

See here for a description of the arguments.

get(*, name='', **query_args)

Deprecated. Use .as_table() instead. See documentation of CSeries.get for details about deprecation.

in_range(minval, maxval)

Validation that column values lie in specified range

Applies to numerical/numerical vector columns only

PARAMETER DESCRIPTION
minval

minimum (inclusive);

maxval

maximum (inclusive)

RETURNS DESCRIPTION
Validation

Validator for use in crandas.crandas.DataFrame.validate

ix()

Returns the index of the column

max(*, as_table=False, threshold=None, **query_args)

Computes the maximum of the series

See here for a description of the arguments.

mean(*, as_table=False, threshold=None, **query_args)

Computes the mean of the elements of the series.

See here for a description of the arguments.

min(*, as_table=False, threshold=None, **query_args)

Computes the minimum of the series

See here for a description of the arguments.

std(*, ddof=1, as_table=False, threshold=None, **query_args)

Computes the standard deviation of the series.

See here for a description of the arguments.

PARAMETER DESCRIPTION
ddof

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample std and a ddof of 0 to the population std.

TYPE: int DEFAULT: 1

sum(*, as_table=False, threshold=None, **query_args)

Computes the sum of the elements of the series

PARAMETER DESCRIPTION
as_table

if True, result is returned as pd.DataFrame instead of value

TYPE: boolean DEFAULT: False

threshold

if given, only return value as long as the number of not-NULL elements is above the minimum threshold of elements for the operation

TYPE: Optional[MaybePlaceholder[NonNegativeInt]] DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
int / float / Deferred / DataFrame / DataFrame

Result of applicable type, depending on as_table and mode. Note that the return value is a regular Python int/float, rather than a numpy int/float.

sum_in_range(minval, maxval)

Validation that sum of column values lies in specified range

Applies to integer/integer vector columns only

PARAMETER DESCRIPTION
minval

minimum (inclusive);

maxval

maximum (inclusive)

RETURNS DESCRIPTION
Validation

Validator for use in crandas.crandas.DataFrame.validate

sum_squares(*, as_table=False, threshold=None, **query_args)

Computes the sum of squares of the elements of the series.

See here for a description of the arguments.

type()

Returns the type of a column

var(*, ddof=1, as_table=False, threshold=None, **query_args)

Computes the variance of the series.

See here for a description of the arguments.

PARAMETER DESCRIPTION
ddof

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample variance and a ddof of 0 to the population variance.

TYPE: int DEFAULT: 1

CSeriesFun(op, vals, args={}, **kwargs)

Bases: CSeries

Subclass of CSeries over which a function was applied to it

Col(name, type='?', elperv=-1, nullable=False, constraints=None, modulus=None, _ctype=None, _schema=None, **kwargs)

Represents the type of a column.

The type and elperv fields can be equal to "?" and -1, respectively, to indicate that these are not known (e.g., for columns in an expected specification to vdl_query).

PARAMETER DESCRIPTION
name

column name

TYPE: str

type

column type, this can be "?" if the column type is not known (for example, a function return value of a Transaction).

TYPE: str DEFAULT: '?'

elperv

number of secret-shared elements that represent a single column value. This can be -1 if not known.

TYPE: int DEFAULT: -1

ctype property

Ctype for the column; see Ctype Schemas

schema property

Schema for the column; see Ctype Schemas

__eq__(other)

Checks structural equality between columns

__repr__()

Returns printable representation

renamed(name)

Return copy of the Col with a different name

CtypeSpuriousColumnsWarning(*, columns, **kwargs)

Bases: UserWarning

Warning that ctype contains spurious columns

For example, below, the ctype has a spurious column b:

cd.DataFrame({"a": [1]}, ctype={"a": "int", "b": "varchar"})

DataFrame(data=None, *args, ctype=None, schema=None, auto_bounds=None, **kwargs)

Bases: StateObject

Dataframe stored in the engine

The DataFrame class provides access to tables stored in the engine using an API modeled upon pandas DataFrame.

The constructor creates and uploads a DataFrame similarly to Pandas. A DataFrame may alternatively be obtained:

To see detailed information about the columns of a DataFrame, use the .columns attribute.

The constructor calls the pandas DataFrame constructor pd.DataFrame(data, *args, ...), and uploads the resulting table using upload_pandas_dataframe.

Further arguments apart from data are passed to pd.DataFrame or upload_pandas_dataframe as appropriate. In particular, name can be used to specify a name for the table, and and auto_bounds can be used to disable warnings about automatically derived column bounds; see Query Arguments. Further, the ctype and schema arguments can be used to define columns and their types; see upload_pandas_dataframe.

When specifying data as a dict of Series arguments, it is possible to use crandas.crandas.Series instead of pd.Series. This allows to specify the crandas ctype directly with the data.

PARAMETER DESCRIPTION
data

Contents of the dataframe, passed on to pd.DataFrame.

If data is None and schema is given, an empty dataframe according to the given schema is given. Otherwise, an empty dataframe with no columns is created.

TYPE: any DEFAULT: None

ctype

ctypes for columns; see Ctype Schemas

TYPE: dict or str DEFAULT: {}

schema

schema for uploaded data; see Ctype Schemas

DEFAULT: None

auto_bounds

TYPE: bool DEFAULT: None

columns = ObjectProperty('_columns') class-attribute instance-attribute

Columns of the DataFrame, call .columns.cols to iterate; see crandas.crandas.CIndex

ctype property

Ctype for the table; see Ctype Schemas

nrows = ObjectProperty('_nrows') class-attribute instance-attribute

Number of rows of the DataFrame

schema property

Schema for the table; see Ctype Schemas

shape property

Returns a pair with the number of rows and number of columns

__getitem__(key)

__getitem__(key: CSeries) -> DataFrame
__getitem__(key: list) -> DataFrame
__getitem__(key: str) -> CSeriesColRef
__getitem__(key: slice) -> DataFrame

Implements df[key]

RAISES DESCRIPTION
TypeError

the key must be one of the accepted types

__setitem__(key, value)

Implements self[key] = value. Updates the dataframe inplace with the new column.

PARAMETER DESCRIPTION
key

Column name to be assigned.

TYPE: str

value

Value to assign to the specified column. See .assign() for details.

TYPE: CSeries / callable function / DataFrame / constant value

add_prefix(prefix)

Implements pandas.DataFrame.add_prefix

PARAMETER DESCRIPTION
prefix

prefix to be added

TYPE: str

RETURNS DESCRIPTION
DataFrame

Copy of dataframe where prefix is added to all column names

add_suffix(suffix)

Implements pandas.DataFrame.add_suffix

PARAMETER DESCRIPTION
suffix

suffix to be added

TYPE: str

RETURNS DESCRIPTION
DataFrame

Copy of dataframe where suffix is added to all column names

append(other, ignore_index=True)

Implements pandas.DataFrame.append by calling crandas.concat accordingly

PARAMETER DESCRIPTION
other

The data to append.

TYPE: DataFrame

ignore_index

If True, the resulting axis will be labeled 0, 1, …, n - 1., currently only True is allowed, by default True

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DataFrame

The concatenated table

assign(query_args=None, **assignments)

Implements pandas.DataFrame.assign. Assigns new columns to a DataFrame, and outputs a new DataFrame with the new columns.

Assigned values need to be CSeries (or callable providing a CSeries), assignment of clear Series/array not supported.

To pass query arguments such as the target table name, specify them as values of the query_args dict, e.g.:

cdf.assign(newcol=1, query_args={"name": "tablename"})

To create a column whose name could be an engine query argument, add a query_args argument to disambiguate, e.g.:

cdf.assign(name="Column value", query_args={})

Avoid passing a generic dictionary like cdf.assign(**assignments) as this is ambiguous when assignments contains a column that is also a query_arg. Use crandas.crandas.DataFrame.assign_dict for this instead.

assign_dict(assignments, *, query_args=None, inplace=False)

assign_dict(
    assignments,
    *,
    query_args=...,
    inplace: Literal[False] = ...
) -> DataFrame
assign_dict(
    assignments, *, query_args=..., inplace: Literal[True]
) -> None

Assigns new columns to a DataFrame, and outputs a new DataFrame with the new columns or updates the current DataFrame inplace. Similar to .assign(), but takes the assignments as a dict instead of as keywords and allows for inplace operation.

PARAMETER DESCRIPTION
assignments

Dictionary containing the assignments.

TYPE: dict

query_args

See Query Arguments. Note that this is taken as a dictionary.

TYPE: (optional, dict) DEFAULT: None

inplace

Boolean indicating whether the operation should be performed inplace. When False, a new DataFrame is created and returned. When True, the current DataFrame is updated and also returned.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataFrame

The updated dataframe.

astype(ctype=None, *, schema=None, validate=False, **query_args)

Converts DataFrame to specified ctype or schema

If a ctype is given, convert the respective columns to the respective types, keeping other columns intact. If a schema is given, ensure that the resulting table conforms to the schema in terms of columns, orders, and types (dropping columns from the column if they do not occur in the schema).

PARAMETER DESCRIPTION
ctype

ctypes for columns; see Ctype Schemas, by default None

TYPE: dict or str DEFAULT: None

schema

schema for uploaded data; see Ctype Schemas, by default None

TYPE: dict DEFAULT: None

validate

whether to validate the conversion, by default False

TYPE: bool DEFAULT: False

query_args

DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Dataframe with the set ctypes and columns

RAISES DESCRIPTION
ValueError

Provide ctype or schema, but not both

ValueError

Need to specify ctype or schema

corr(method='pearson', min_periods=None, numeric_only=False, dropna=False)

Compute pairwise correlation of columns. This process computes the Pearson correlation coefficient, which is a measure of linear correlation. For more information, see here.

PARAMETER DESCRIPTION
method

Method used to compute the correlation. Currently only "pearson" is supported.

TYPE: str DEFAULT: "pearson"

min_periods

Minimum number of observations needed to have a valid result.

TYPE: int DEFAULT: None

numeric_only

if True all non-numeric columns are ignored, when False will return an error if a non-numeric column is found, by default False.

TYPE: bool DEFAULT: False

dropna

if True all rows that contain null values are dropped, by default False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataFrame

Dataframe of the correlation matrix of numeric columns.

RAISES DESCRIPTION
ValueError

At least two numeric columns are needed to compute correlation.

describe()

Generate descriptive statistics

drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=True)

Remove duplicate rows from DataFrame.

PARAMETER DESCRIPTION
subset

If a string iterable is given, use these as the name of the columns for identifying duplicates. If a string is given, use this as the name of the single column to identify duplicates. If no subset is given, all columns are used to identify duplicates.

DEFAULT: None

keep

Determines which duplicates to keep. * 'first': only keep the first occurrence. * 'last': only keep the last occurrence. * 'any': keep a single random occurrence, this one is more efficient. * False: remove all occurences.

DEFAULT: 'first'

RETURNS DESCRIPTION
DataFrame

Dataframe with the duplicates removed

RAISES DESCRIPTION
ValueError

keep must be either "first", "last", "any" or False

ValueError

Subset should be either None, or consist of column names of the table.

dropna(axis=0, subset=None, **query_args)

Remove null values from a DataFrame. The axis determines whether rows (0) or columns (1) are removed. Only rows is implemented. Returns a DataFrame with no nullable columns or values.

PARAMETER DESCRIPTION
axis

determines whether rows (0) or columns (1) are deleted, by default 0. Only row method is allowed

TYPE: int / string DEFAULT: 0

subset

When present, remove the null values only from the provided columns.

TYPE: list of column names DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

The original dataframe with all rows or columns (depending on axis) with a null value removed

RAISES DESCRIPTION
NotImplementedError

Dropping columns is not implemented for structural reason

ValueError

Error whenever a wrong axis is entered

fillna(value, **query_args)

Fill NULL values

Replace NULL values of nullable columns as specified by the value argument. The resulting columns are not nullable anymore.

If value is a valid argument to crandas.crandas.CSeries.fillna (e.g., an integer, a column, or a function), then this argument is applied to fill in NULLs in all nullable columns of the table (and so value needs to be of the correct type for all nullable columns).

If value is a dict, then the respective dictionary values are provided (as above) to fill in NULLs for the respective column.

If value is a DataFrame, then the respective columns of value are used to fill in NULLs for the corresponding columns of self. This is done only for columns that occur in both dataframes.

PARAMETER DESCRIPTION
value

Values to fill in for NULLs (see above)

TYPE: value / dict / DataFrame

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Copy of table with NULLs replaced as indicated

filter(key, threshold=None, **query_args)

Filter table

Returns table with all rows of the original table satisfying the criterion represented by key.

key can be a CSeries representing a table column or a computation on table row(s). In this case, the CSeries values need to be 1 (indicating that the corresponding row will be selected) or 0 (indicating that the row will not be selected).

If the CSeries used for indexing has a threshold (see CSeries.with_threshold), the filtered result is only returned if it has the minimum number of rows as indicated by the threshold.

Alternatively, key can be a function to be applied to the table columns. The function is called with one argument representing the table, of which the fields correspond to the columns. E.g., key lambda x: x.col1==1 represents the function that checks whether the value of column with name col1 equals one.

See function_to_json for more information.

PARAMETER DESCRIPTION
key

Filter criterion

TYPE: CSeries or callable

threshold

If given, sets a minimum amount of rows that the resulting table needs to have; otherwise, the server returns an error. Equivalent to calling filter with key.with_threshold(threshold)

TYPE: (int, non - negative) DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

The filtered table

groupby(cols, *, dropna=True, bypass_same_value_check=False, threshold=None, **query_args)

Computes a grouping of the table by the values of (a) given column(s). Returns a grouping object that can be used in aggregation (see CSeriesGroupBy) or as an argument to crandas.merge().

PARAMETER DESCRIPTION
cols

If a string iterable is given, use these as the name of the columns to group by. If a string is given, use this as the name of the single column to group by. If a DataFrameGroupBy object (i.e., a result of an earlier groupby operation) is given, re-use that grouping as a grouping of the current table. (The current table needs to have a column with the given name and the same values as the table for which the grouping was originally made.)

TYPE: str iterable or str or DataFrameGroupBy

dropna

If True, any rows with null values will be dropped. If False, null values will be treated as a separate key in groups. Currently, only False is supported when using nullable columns.

DEFAULT: True

bypass_same_value_check

Unless set, if the given cols is a DataFrameGroupBy instance made from an other table, it is checked if that columns are the same for the current and the other table. Bypassing this check may avoid problems where the check cannot be performed for large values.

DEFAULT: False

threshold

If given, only succeed as long as all groupings have at least this many elements.

TYPE: Optional[MaybePlaceholder[NonNegativeInt]] DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrameGroupBy

Grouping object

max(axis=0)

Computes the maximum of each (numeric) column

PARAMETER DESCRIPTION
axis

Which axis of the dataframe, only 0 is implemented, by default 0

TYPE: int DEFAULT: 0

min(axis=0)

Computes the minimum of each (numeric) column

PARAMETER DESCRIPTION
axis

Which axis of the dataframe, only 0 is implemented, by default 0

TYPE: int DEFAULT: 0

open(*, limit=None, offset=None, **query_args)

Returns table in opened form

PARAMETER DESCRIPTION
limit

Limit to the number of opened rows. The number of returned rows will be the minimum of the number of rows remaining and the provided limit.

TYPE: int DEFAULT: None

offset

Offset of the opened rows. Start returning rows from the provided 0-based index.

TYPE: int DEFAULT: None

project(cols, **query_args)

Project table

Returns table with same rows but a selection of columns

PARAMETER DESCRIPTION
cols

Columns to select. Can be empty. Columns can occur multiple times.

TYPE: list of str | str

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

The projected table

rename(columns, **query_args)

Implements pandas.DataFrame.rename Only renaming of columns via columns argument is supported.

PARAMETER DESCRIPTION
columns

dictionary of columns to be renamed of the form {"oldname": "newname"}

TYPE: dict

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Dataframe with updated column names

sample(*, n=None, frac=None, random_state=None, **query_args)

Samples rows from the dataframe.

The number of rows can be specified either as an integer n or a fraction frac. The case frac==1 corresponds to returning a shuffling of the table and is equivalent to crandas.crandas.DataFrame.shuffle.

If a random_state is given, the sampling is performed in a deterministic way and according to a public selection (i.e., known to the servers and predictable to the client); otherwise, the sampling is non-deterministic and private (not known to the client and servers). See also crandas.crandas.DataFrame.shuffle.

PARAMETER DESCRIPTION
n

Number of rows to sample

DEFAULT: None

frac

Proportion of rows (between 0.0 and 1.0, inclusive) to sample

DEFAULT: None

random_state

Seed for deterministic sampling (otherwise is non-deterministic)

TYPE: long integer DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Copy of the table with rows sampled

set_axis(labels, *, axis=0, copy=True, **query_args)

Implements pandas.DataFrame.set_axis

Set column names to specified list.

For consistency with pandas, axis=1 or axis='columns' needs to be explicitly specified.

The 'copy' argument is ignored.

PARAMETER DESCRIPTION
labels

List of column names

TYPE: list of str

axis

Axis to update; needs to be set to 1 or 'columns' to update columns

TYPE: 1 or `'columns'` DEFAULT: 0

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Dataframe with updated column names

shuffle(*, random_state=None, **query_args)

Return table with rows shuffled. If a random_state is given, the shuffle is deterministic and performed according to a public permutation (i.e., known to the servers and predictable to the client); otherwise, the shuffle is non-deterministic and private (not known to the client and servers).

PARAMETER DESCRIPTION
random_state

Seed for deterministic shuffle (otherwise is non-deterministic)

TYPE: long integer DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Copy of the table with rows shuffled

slice(key, allow_fractions=False, **query_args)

Slice table

Returns table with same columns but a selection of rows

PARAMETER DESCRIPTION
key

Python slice object representing rows to select

TYPE: slice

allow_fractions

If set to True, the start and stop values in the slice can be specified as fractions of the total dataset, instead of as fixed offsets. Fractions can be negative as well.

TYPE: bool DEFAULT: False

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

The sliced table

sort_values(by, *, ascending=True, **query_args)

Sorts the dataframe according to the values in the column by. Currently, sorting on strings is not supported.

PARAMETER DESCRIPTION
by

The column to sort on

TYPE: str

ascending

Sort in ascending order if True, in descending order if False.

TYPE: bool DEFAULT: True

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Copy of the sorted table

validate(*validations, **query_args)

Applies input validation to the table.

Input validation leads to a table that has the validations as constraints on the respective columns (e.g., checking that a column contains values in [0,2] leads to a column with values constrained to that domain). These constraints can be inspected by accessing tab.columns.cols[i].constaints.

Validations are instances of the crandas.crandas.Validation class and can be set by calling validation functions such as in_range() and sum_in_range().

PARAMETER DESCRIPTION
*validations

Validations to apply to the table

TYPE: list of [crandas.crandas.Validation][] objects DEFAULT: ()

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

If all validations have succeeded: copy of the table having the validations as constraints

vdl_query_inplace(cmd, expected, *, inplace=False, session=None, **query_args)

Wrapper around vdl_query allowing for inplace operations

PARAMETER DESCRIPTION
cmd

See vdl_query.

TYPE: (JSON-serializable) dict

expected

Expected columns of the result.

TYPE: CIndex

inplace

Boolean indicating whether an inplace update should be performed. When True, self will contain the result. When False, a new DataFrame is created with the result.

TYPE: bool DEFAULT: False

session

Session object to use. If None, use crandas.base.session.

TYPE: Session / None DEFAULT: None

**query_args

TYPE: extra arguments passed to vdl_query DEFAULT: {}

DataSpuriousColumnsWarning(*, columns, **kwargs)

Bases: UserWarning

Warning that data contains spurious columns

For example, below, the data has a spurious column b:

cd.DataFrame({"a": [1], "b": [2]}, schema={"a": "int"})

ExpectDropResult(*, expected_len, **kwargs)

Bases: ResponseHandler

Response handler for drop response with given expected length

ReturnValue()

Bases: StateObject, CSeries

Represent a value or series of values computed by the engine

Various engine commands, e.g., .sum(), return values or series of values, as opposed to returning a DataFrame. This class is the analogue of DataFrame that is used to represent such remote values.

A ReturnValue can be used as a CSeries, making it possible e.g. to filter on a value computed by the engine without having to open it. For example, the following filters all maximum elements without revealing the maximum: tab[tab["col"]==tab["col"].max(mode="regular")].

To obtain the value/series in the clear, call: crandas.crandas.ReturnValue.open. This returns a single value, unless .is_series is set, in which case it returns a Pandas series, which needs to have .num_rows rows if set.

Validation(table, col, json_desc)

Represents a validation that can be applied to a column.

Returned by functions like in_range(), etc. Used as an argument to .validate.

json_desc can contain a combination of the following keys and values:

  • bounds ([int string, int string]) the lower and upper bounds of the data, represented as strings for arbitrary precision
  • sum_bounds ([int string, int string]) the lower and upper bounds of the sum of an array entry, represented as strings for arbitrary precision
  • precision (int) fixed-point precision
  • is_array (bool) boolean representing whether the column is an array

If the column is of type int, it can have the following keys: bounds, sum_bounds, is_array

If the column is of type fixed point, it can have the following keys: bounds, is_array, precision

Series(*args, ctype=None, **kwargs)

Similar to pandas.Series, but additionally allows the user to specify a ctype

For example:

cd.DataFrame({"a": cd.Series([1,2,3], ctype="int16?")})

choose_threshold(obj_threshold, arg_threshold)

Choose threshold where obj_threshold is given using the with_threshold() function (and hence, set in the DataFrame or CSeries itself, and arg_threshold is specified as an argument to the aggregation function or filter() function.

Returns the threshold or None.

col(name)

Expression representing a column of a DataFrame, similar to the polars function

concat(tables_, *, ignore_index=True, axis=0, join='outer', **query_args)

Table concatenation.

Performs horizontal/vertical concatenation of tables, modelled on pandas pd.concat. Currently, only inner joins are supported for vertical concatenation. The first table defines the set of columns that the resulting table has. If join="inner", only columns common to all tables are included. Else, the remaining tables need to have the same set of columns as the first table (up to ordering), else an error is returned.

PARAMETER DESCRIPTION
tables_

One or more DataFrames to be concatenated

TYPE: list[DataFrame]

ignore_index

does nothing, but is used in crandas.append, by default True

TYPE: bool DEFAULT: True

axis

Concatenation axis, 0=vertical, 1=horizontal, by default 0

TYPE: int DEFAULT: 0

join

type of join (currently only inner join is supported for vertical join), by default "outer"

TYPE: str DEFAULT: 'outer'

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

mode-dependent return table representing vertical/horizontal join

RAISES DESCRIPTION
RuntimeError

Received wrong inputs

NotImplementedError

Limited vertical concatenation is allowed, there must be a matching column on both tables to be concatenated

ValueError

Limited vertical concatenation is allowed, number of columns should be the same in all tables

RuntimeError

Horizontal join would create table with duplicate column names

cut(series, bins, *, labels, right=True, add_inf=False)

Bin values into discrete intervals (aka quantization)

Bins values into discrete intervals, a la pandas.cut. Quantizes Series into bins [bins[0],bins[1]), [bins[1],bins[2]), etc, and returns the corresponding bin labels (so labels[0] for bin [bins[0],bins[1]), labels[1] for bin [bins[1],bins[2]), etc. The bins include the left edge and exclude the right edge.

The first bin should have -np.inf as left edge and the last bin should have np.inf as its right edge. If the argument add_inf is set to true, these edges are automatically added and do not need to be given as arguments.

The bins and labels can be given in the plain (e.g., cd.cut(cdf["col"], [-np.inf, 0, 10, np.inf], [1, 2, 3]), or as columns providing respective bins and labels for the respective input rows (e.g., cd.cut(cdf["col"], cd["bins"], cd["labels"])). In the latter case, the argument add_inf=True should be given.

PARAMETER DESCRIPTION
series

series to apply quantization to

TYPE: CSeries

bins

list of integers or int_vec column defining the bin edges

TYPE: int list, CSeries

labels

list of integers or int_vec column defining the bin labels (one more element than bins if add_inf is True or one less otherwise)

TYPE: int list

right

specifies whether bins include their right edges

TYPE: bool DEFAULT: True

add_inf

when set to False, bins should include -np.inf and np.inf; when set to True they are automatically added. Can only be set to True when binsis a CSeries

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
CSeriesFun

representing the result of the quantization

demo_table(number_of_rows=1, number_of_columns=1, **query_args)

Create demo table.

Creates a demo table with the given number of rows and columns. The columns are respectively named "col1", "col2", ... and have sequential integer values 1, 2, ...

A nonce is included in the command so that every time this command is called, it receives a fresh table handle.

PARAMETER DESCRIPTION
number_of_rows

Number of rows of resulting table, by default 1

TYPE: int DEFAULT: 1

number_of_columns

Number of columns of resulting table, by default 1

TYPE: int DEFAULT: 1

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

A demo table with a fresh name

get_table(handle_or_name=None, /, *, dummy_handle=None, prod_handle=None, schema=None, check=True, map_dummy_handles=None, session=None, **query_args)

Access a previously uploaded table by its handle or name.

The previously uploaded table is specified using the handle_or_name argument. Alternatively, both dummy_handle and prod_handle can be specified. In that case, no changes have to be made between recording and executing a script.

When get_table is called with a schema argument and/or from a recorded script, the retrieved table is checked against the schema that was specified or that was used when recording. If this schema check fails, a ValueError is raised. This check can be disabled by passing check=False.

Note that a name argument, if given, is interpreted as being part of the standard query arguments, and is thus interpreted as a target name for the result of the get query. Accordingly, get_table("a", name="b") can be used to assign the (additional) symbolic name "b" to the table with name or handle "a".

PARAMETER DESCRIPTION
handle_or_name

Handle (hex-encoded string) or name. Gets interpreted as a handle if it is a 64 hexadecimal (uppercase) string, otherwise as a name.

TYPE: (optional, str) DEFAULT: None

dummy_handle

Handle (hex-encoded string) to be used in design. Used when either recording a script or when no script is active.

TYPE: (optional, str) DEFAULT: None

prod_handle

Handle (hex-encoded string) to be used in production. Used when executing an approved script.

TYPE: (optional, str) DEFAULT: None

schema

Represents the structure of the table to be added. Needed if get_table is called from a Transaction, or if it is desired to check that the table corresponds to the given schema, by default None. The schema can be specified as:

  • schema dict (e.g., cdf.schema; see Ctype Schemas)
  • list of column names (e.g., ["col1", "col2"]]
  • pandas DataFrame
  • any valid argument to pandas.read_csv
  • CIndex (e.g., cdf.columns)

TYPE: optional DEFAULT: None

check

Enables server-side validation of the object type and schema (if given).

In scripts, if check==True, then the schema of the table is checked against the table that the script was recorded with. To disable this check, use check=False during the script recording.

TYPE: bool DEFAULT: True

map_dummy_handles

Whenever a script is being recorded (see crandas.script), the default behavior is to interpret all calls to get_table(handle) as dummy_for:<handle> table names. This allows the user to use the same handle in both script recording and execution, even though the script recording takes place in a different environment where the real table handle does not exist.

This behavior can be overridden in two levels: for the entire script or for a single call to get_table. For the entire script, mapping dummy handles can be disabled by supplying map_dummy_handles as False in the call to crandas.script.record. For the call to get_table, by specifying this argument as either True or False, the mapping behavior is forced to be either enabled or disabled, regardless of the current script mode.

TYPE: bool DEFAULT: None

query_args

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

The table with handle handle_or_name

RAISES DESCRIPTION
ValueError
  • Schema validation failed
  • Schema not specified (when performed in a transaction)

merge(left, right, how='inner', on=None, left_on=None, right_on=None, validate=None, suffixes=('_x', '_y'), session=None, **query_args)

Merge tables using a database-style join. Implements pandas.merge.

The following types of merge are supported:

  • inner join: returns only the rows where the join columns match
  • outer join: returns rows from both tables, matched where possible, starting with rows of left table in their original order
  • left join: return rows of left table in original order, matched with a row of the right table where possible
  • right join: return rows of right table, matched with a row of the left table where possible

Columns to join on are given either by a common on argument, or separate left_on and right_on arguments for the left and right tables. Depending on the arguments passed, the join columns can contain duplicates:

  • if on/left_on/right_on are column name(s), a one-to-one merge is performed, and an error is given if the merge columns contain duplicates;
  • if a .groupby() result is given as left_on argument, a many-to-one merge is performed where the left table can have duplicates. In this case, the servers learn the number of unique matching keys.
  • if a .groupby() result is given as right_on argument, a one-to-many merge is performed where the right table can have duplicates. In this case, the servers learn the number of unique matching keys.
  • many-to-many merges (where both tables contain duplicate values for the merge columns) are only supported with some leakage of information about the underlying data; see [crandas.caution.merge_m2m()][crandas.caution.merge_m2m].

In contrast to pandas, keys with null values do not match with each other and are not included in the result.

PARAMETER DESCRIPTION
left

Left table to be joined

TYPE: DataFrame

right

Right table to be joined

TYPE: DataFrame

how

Type of join

TYPE: "inner" (default), "outer", or "left" DEFAULT: 'inner'

on

Column(s) to join on; must be common to both tables

TYPE: str or list of str DEFAULT: None

left_on

Column(s) of the left table to join on

TYPE: str, list of str, or DataFrameGroupBy DEFAULT: None

right_on

Column(s) of the right table to join on, by default None

TYPE: str or list of str DEFAULT: None

validate

Can be "one_to_one", "1:1", "one_to_many", "1:m", "many_to_one" or "m:1". If given, it is compared against the type of merged derived from the left_on and right_on arguments as discussed above, and an exception is raised if it is incorrect

TYPE: str DEFAULT: None

query_args

engine query arguments

TYPE: (optional, dict) DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

Result of the merging operation

RAISES DESCRIPTION
MergeError

Values of the join columns are not unique

ValueError

Incorrect combination of arguments

pandas_dataframe_schema(df, ctype=None, auto_bounds=None, schema=None)

Determine schema for pandas DataFrame

Tries to encode the given data, and returns schema of crandas DataFrame that would result from calling upload_pandas_dataframe() with the passed ctype and schema. See Ctype Schemas.

PARAMETER DESCRIPTION
df

DataFrame from which to generate the schema

TYPE: DataFrame

ctype

DEFAULT: None

auto_bounds

DEFAULT: None

schema

DEFAULT: None

RETURNS DESCRIPTION
dict(schema)
see [Ctype Schemas][ctypes-schemas]

read_csv(file_name, **kwargs)

Upload the given CSV file to the engine

Internally calls pd.read_csv(file_name, ...) to read the CSV file and calls upload_pandas_dataframe() on the resulting DataFrame.

PARAMETER DESCRIPTION
file_name

name of the file

TYPE: str | PathLike

**kwargs

DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

uploaded table

read_csv_schema(file_name, ctype=None, auto_bounds=None, schema=None, **kwargs)

Determine schema for CSV file

Tries to load the CSV and encode it for use by crandas. Returns schema of crandas DataFrame that would result from calling crandas.crandas.read_csv. See Ctype Schemas.

PARAMETER DESCRIPTION
file_name

argument to pd.read_csv

TYPE: str

ctype

TYPE: dict or str DEFAULT: None

auto_bounds

TYPE: bool DEFAULT: None

schema

TYPE: dict DEFAULT: None

RETURNS DESCRIPTION
dict

schema; see Ctype Schemas

read_parquet(file_name, **kwargs)

Upload the given Apache Parquet file to the engine

If file_name is a pyarrow.Table, calls file_name.to_pandas(...) and then applies upload_pandas_dataframe() to the result. Otherwise, call ParquetFile(file_name, ...), and then apply crandas.crandas.upload_streaming_file to the result. Keyword arguments **kwargs can be provided for any of these called functions.

PARAMETER DESCRIPTION
file_name

name of the file or path to the file to be read

TYPE: str | Path-like object | pyarrow.Table

**kwargs

DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

uploaded table

remove_objects(objects, **query_args)

Remove objects from server

If the list of objects contained a non-existent object, an error is raised. Still, all objects from the list are removed.

PARAMETER DESCRIPTION
objects

Objects to be removed

query_args

TYPE: (optional, dict) DEFAULT: {}

RAISES DESCRIPTION
EngineError

Some of the given objects did not exist

series_max(col1, col2)

Compute the maximum of two CSeries

PARAMETER DESCRIPTION
col1

integer series;

TYPE: CSeries

col2

integer series;

TYPE: CSeries

RETURNS DESCRIPTION
integer CSeries

the maximum of the values of col1 and col2

series_min(col1, col2)

Compute the minimum of two CSeries

PARAMETER DESCRIPTION
col1

integer series;

TYPE: CSeries

col2

integer series;

TYPE: CSeries

RETURNS DESCRIPTION
integer CSeries

the minimum of the values of col1 and col2

upload_pandas_dataframe(df, ctype=None, *, auto_bounds=None, schema=None, **query_args)

Uploads an existing pandas DataFrame into the engine

PARAMETER DESCRIPTION
df

DataFrame to upload

TYPE: DataFrame

ctype

ctypes for columns; see Ctype Schemas

TYPE: dict DEFAULT: None

auto_bounds

TYPE: bool DEFAULT: None

schema

schema for uploaded data; see Ctype Schemas

DEFAULT: None

**query_args

TYPE: query arguments DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

the uploaded DataFrame

upload_streaming_file(file_reader, file_type, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)

Uploads a file into the engine. Instead of loading the entire file, a file reader is used to load the table columnwise when uploading. Columns are then deleted, using considerably less memory. Currently only works for parquet files.

PARAMETER DESCRIPTION
file_reader

file to upload

TYPE: ParquetFile

file_type

a string describing the type of file of file_reader

TYPE: str

ctype

explicitly given types for columns

DEFAULT: None

auto_bounds

TYPE: bool DEFAULT: None

schema

schema for uploaded data; see Ctype Schemas

DEFAULT: None

RETURNS DESCRIPTION
DataFrame

the uploaded DataFrame

vec_from_columns(series)

Combine columns into a single vector column

Combines the multiple numeric columns into a single vector column containing the input columns. For example:

    tab = cd.DataFrame({"a": [1, 2], "b": [3, 4]})
    tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
    tab["vec"].open() == [[1, 3], [2, 4]]

    tab = cd.DataFrame({"a": [1.1, 2.1], "b": [3.2, 4.2]})
    tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
    tab["vec"].open() == [[1.1, 3.2], [2.1, 4.2]]

    tab = cd.DataFrame({"a": [1.1, 2.1], "b": [3, 4]})
    tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
    tab["vec"].open() == [[1.1, 3.0], [2.1, 4.0]]

PARAMETER DESCRIPTION
series

List of non-vector numeric series.

TYPE: crandas.crandas.CSeries][]

RETURNS DESCRIPTION
CSeriesFun

Series containing the vector column.

when(*predicates, **constraints)

Polars-style when-then-otherwise expression

The expression cd.when(<condition>).then(<value-if-condition>).otherwise(<value-otherwise>) returns a CSeries for assigning, filtering, etc, allowing to select values in a convenient way.

For example: cdf = cdf.assign(val = cd.when(cdf["x"]>0).then(cdf["x"]).otherwise(-cdf["x"])) selects cdf["x"] if cdf["x"]>0 and -cdf["x"] otherwise, i.e., computes the absolute value.

If multiple predicates are given, they all need to be satisfied. Constraints represent equality checks, e.g., cd.when(x=0) is equivalent to cd.when(x==0). Multiple .when().then() statements can be chained, optionally followed by an .otherwise(<value if all conditions are false>) statement. For example, consider cd.when(<condition1>).then(<value1>).when(<condition2>).then(<value2>).otherwise(<value3>). In this case, if <condition1> holds, then <value1> is selected. If <condition1> does not hold but <condition2> does, then <value2> is selected. If neither condition holds, <value3> is selected. If no .otherwise(...) is given, None is assumed.