crandas.crandas

Main crandas functionality: dataframes (CDataFrame), series (CSeries), and analysis operations (e.g., merge())

class crandas.crandas.CDataFrame(columns=None, *, schema_target=None, **kwargs)

Bases: StateObject

Dataframe stored in the engine CDataFrame provides access to tables stored in the engine using an API modeled upon pandas DataFrame. A CDataFrame may be obtained in one of the following ways:

  • By uploading data into the engine using read_csv()/read_parquet() or upload_pandas_dataframe()

  • By accessing an earlier uploaded table using get_table()

To see detailed information about the columns of a CDataFrame, use the .columns attribute.

__getitem__(key)

Implements df[key]

  • If key is a CSeries or a function, call CDataFrame.filter.

  • If key is a list, call CDataFrame.project

  • If key is a str, return a CSeries representing the column with the given name

  • If key is a slice, call CDataFrame.slice

Raises:

TypeError – the key must be one of the accepted types

add_prefix(prefix)

Implements pandas.DataFrame.add_prefix

Parameters:

prefix (str) – prefix to be added

Returns:

Copy of CDataFrame where prefixis added to all column names

Return type:

CDataFrame

add_suffix(suffix)

Implements pandas.DataFrame.add_suffix

Parameters:

suffix (str) – suffix to be added

Returns:

Copy of CDataFrame where suffix is added to all column names

Return type:

CDataFrame

append(other, ignore_index=True)

Implements pandas.DataFrame.append by calling crandas.concat accordingly TODO: To be deprecated

Parameters:
  • other (DataFrame or CDataFrame) – The data to append.

  • ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1., currently only True is allowed, by default True

Returns:

The concatenated table

Return type:

CDataFrame

assign(query_args=None, **assignments)

Implements pandas.DataFrame.assign. Assigns new columns to a CDataFrame, and outputs a new CDataFrame with the new columns.

Assigned values need to be CSeries (or callable providing a CSeries), assignment of clear Series/scalar/array not supported.

To pass query arguments such as the target table name, specify them as values of the query_args dict, e.g.:

cdf.assign(newcol=1, query_args={"name": "tablename"})

To create a column whose name could be an engine query argument, add a query_args argument to disambiguate, e.g.:

cdf.assign(name="Column value", query_args={})

astype(ctype=None, *, schema=None, validate=False, **query_args)

Converts CDataFrame to specified ctype or schema

If a ctype is given, convert the respective columns to the respective types, keeping other columns intact. If a schema is given, ensure that the resulting table conforms to the schema in terms of columns, orders, and types (dropping columns from the column if they do not occur in the schema).

Parameters:
  • ctype (dict or str, optional) – ctypes for columns; see ctypes_schemas, by default None

  • schema (dict, optional) – schema for uploaded data; see ctypes_schemas, by default None

  • validate (bool, optional) – whether to validate the conversion, by default False

  • query_args (See queryargs)

Returns:

CDataFrame with the set ctypes and columns

Return type:

CDataFrame

Raises:
  • ValueError – Provide ctype or schema, but not both

  • ValueError – Need to specify ctype or schema

clone()

Creates a clone of the StateObject. Only needed for objects that can be opened

The clone is used to open the object: when the object is opened, essentially, self.clone().json_to_opened(…) is called.

columns

Columns of the CDataFrame; sed crandas.crandas.CIndex

property ctype

Ctype for the table; see ctypes_schemas

describe()

Generate descriptive statistics

dropna(axis=0, **query_args)

Remove null values from a CDataFrame. The axis determines whether rows (0) or columns (1) are removed. Only rows is implemented. Returns a CDataFrame with no nullable columns or values.

Parameters:
  • axis (int/string, optional) – determines whether rows (0) or columns (1) are deleted, by default 0. Only row method is allowed

  • query_args – See queryargs

Returns:

The original CDataFrame with all rows or columns (depending on axis) with a null value removed

Return type:

CDataFrame

Raises:
  • NotImplementedError – Dropping columns is not implemented for structural reason

  • ValueError – Error whenever a wrong axis is entered

fillna(value, **query_args)

Fill NULL values

Replace NULL values of nullable columns as specified by the value argument. The resulting columns are not nullable anymore.

If value is a valid argument to CSeries.fillna (e.g., an integer, a column, or a function), then this argument is applied to fill in NULLs in all nullable columns of the table (and so value needs to be of the correct type for all nullable columns).

If value is a dict, then the respective dictionary values are provided (as above) to fill in NULLs for the respective column.

If value is a CDataFrame, then the respective columns of value are used to fill in NULLs for the corresponding columns of self. This is done only for columns that occur in both CDataFrames.

Parameters:
  • value (value/dict/CDataFrame) – Values to fill in for NULLs (see above)

  • query_args – See queryargs

Returns:

Copy of table with NULLs replaced as indicated

Return type:

CDataFrame

filter(key, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Filter table

Returns table with all rows of the original table satisfying the criterion represented by key.

key can be a CSeries representing a table column or a computation on table row(s). In this case, the CSeries values need to be 1 (indicating that the corresponding row will be selected) or 0 (indicating that the row will not be selected).

If the CSeries used for indexing has a threshold (see CSeries.with_threshold), the filtered result is only returned if it has the minimum number of rows as indicated by the threshold.

Alternatively, key can be a function to be applied to the table columns. The function is called with one argument representing the table, of which the fields correspond to the columns. E.g., key lambda x: x.col1==1 represents the function that checks whether the value of column with name col1 equals one.

See function_to_json for more information.

Parameters:
  • key (CSeries or callable) – Filter criterion

  • threshold (int, non-negative, optional) – If given, sets a minimum amount of rows that the resulting table needs to have; otherwise, the server returns an error. Equivalent to calling filter with key.with_threshold(threshold)

  • query_args – See queryargs

Returns:

The filtered table

Return type:

CDataFrame

groupby(cols, *, dropna=True, bypass_same_value_check=False, **query_args)

Computes a grouping of the table by the values of (a) given column(s). Returns a grouping object that can be used in aggregation (see CSeriesGroupBy) or as an argument to crandas.merge().

Parameters:
  • cols (str iterable or str or groupby.CDataFrameGroupBy) – If a string iterable is given, use these as the name of the columns to group by. If a string is given, use this as the name of the single column to group by. If a groupby.CDataFrameGroupBy object (i.e., a result of an earlier groupby operation) is given, re-use that grouping as a grouping of the current table. (The current table needs to have a column with the given name and the same values as the table for which the grouping was originally made.)

  • dropna (bool, default: True) – If True, any rows with null values will be dropped. If False, null values will be treated as a separate key in groups. Currently, only False is supported when using nullable columns.

  • bypass_same_value_check (bool, default: False) – Unless set, if the given cols is a groupby.CDataFrameGroupBy instance made from an other table, it is checked if that columns are the same for the current and the other table. Bypassing this check may avoid problems where the check cannot be performed for large values.

  • query_args – See queryargs

Returns:

Grouping object

Return type:

CDataFrameGroupBy

max(axis=0)

Computes the maximum of each (numeric) column

Parameters:

axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0

min(axis=0)

Computes the minimum of each (numeric) column

Parameters:

axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0

nrows

Number of rows of the CDataFrame

open_dry_run_result()

Return opened version of the object when run in dry-run mode

Should return an object of the same type as json_to_opened.

project(cols, **query_args)

Project table

Returns table with same rows but a selection of columns

Parameters:
  • cols (list of str | str) – Columns to select. Can be empty. Columns can occur multiple times.

  • query_args – See queryargs

Returns:

The projected table

Return type:

CDataFrame

rename(columns, **query_args)

Implements pandas.DataFrame.rename Only renaming of columns via columns argument is supported.

Parameters:
  • columns (dict) – dictionary of columns to be renamed of the form {“oldname”: “newname”}

  • query_args – See queryargs

Returns:

CDataFrame with updated column names

Return type:

CDataFrame

sample(*, n=None, frac=None, random_state=None, **query_args)

Samples rows from the dataframe.

The number of rows can be specified either as an integer n or a fraction frac. The case frac==1 corresponds to returning a shuffling of the table and is equivalent to CDataFrame.shuffle.

If a random_state is given, the sampling is performed in a deterministic way and according to a public selection (i.e., known to the servers and predictable to the client); otherwise, the sampling is non-deterministic and private (not known to the client and servers). See also CDataFrame.shuffle.

Parameters:
  • n (integer, default: None) – Number of rows to sample

  • frac (floating-point, default: None) – Proportion of rows (between 0.0 and 1.0, inclusive) to sample

  • random_state (long integer, default: None) – Seed for deterministic sampling (otherwise is non-deterministic)

  • query_args – See queryargs

Returns:

ret – Copy of the table with rows sampled

Return type:

CDataFrame

property schema

Schema for the table; see ctypes_schemas

shuffle(*, random_state=None, **query_args)

Return table with rows shuffled. If a random_state is given, the shuffle is determinstic and performed according to a public permutation (i.e., known to the servers and predictable to the client); otherwise, the shuffle is non-deterministic and private (not known to the client and servers).

Parameters:
  • random_state (long integer, default: None) – Seed for deterministic shuffle (otherwise is non-deterministic)

  • query_args – See queryargs

Returns:

ret – Copy of the table with rows shuffled

Return type:

CDataFrame

slice(key, **query_args)

Slice table

Returns table with same columns but a selection of rows

Parameters:
  • key (slice) – Python slice object representing rows to select

  • query_args – See queryargs

Returns:

The sliced table

Return type:

CDataFrame

sort_values(by, **query_args)

Sorts the dataframe according to the values in the column by. Currently, sorting on strings is not supported.

Parameters:
  • by (str) – The column to sort on

  • query_args – See queryargs

Returns:

ret – Copy of the sorted table

Return type:

CDataFrame

validate(*validations, **query_args)

Applies input validation to the table.

Input validation leads to a table that has the validations as constraints on the respective columns (e.g., checking that a column contains values in [0,2] leads to a column with values constrained to that domain). These constraints can be inspected by accessing tab.columns.cols[i].constaints.

Validations are instances of the crandas.Validation class and can be set by calling validation functions such as crandas.CSeriesColRef.in_range() and crandas.CSeriesColRef.sum_in_range().

Parameters:
  • *validations (list of crandas.Validation objects) – Validations to apply to the table

  • query_args – See queryargs

Returns:

If all validations have succeeded: copy of the table having the validations as constraints

Return type:

CDataFrame

class crandas.crandas.CIndex(cols, **kwargs)

Bases: object

Index (set of columns) of a CDataFrame

For a regular CDataFrame, this represents the columns (name and type) of the CDataFrame.

For a deferred CDataFrame (in a transaction, or resulting from a dry run), this represents the columns (name and type) that the result of an operation is expected to have based on its inputs. For such an expected column, the name is set, but the type and size (“elements per value”) may be undefined.

__eq__(other)

Checks equality with input

__getitem__(ix)

Returns name of column ix

__len__()

Returns number of columns

property ctype

Ctype corresponding to the CIndex; see ctypes_schemas

get_loc(name)

Get integer location for requested label

Parameters:

name (str) – column name label;

Returns:

index of column with name name

Return type:

int

Raises:

KeyError – value not found

matches_template(expected)

Checks whether the number and names of columns fit a template

property schema

Schema corresponding to the CIndex; see ctypes_schemas

to_dict()

Returns column names in dictionary form

class crandas.crandas.CSeries(**kwargs)

Bases: Summable

One dimensional array which represents either the column of a CDataFrame or the result of applying a rowwise function to one or more columns of a CDataFrame

class DT(outer_instance)

Bases: object

Used to retrieve date units in the pandas way

all(*, mode='open', **query_args)

Computes whether all values of the boolean series are true

See sum() for a description of the arguments.

any(*, mode='open', **query_args)

Computes whether any value of the boolean series is true

See sum() for a description of the arguments.

as_series(**query_args)

Return crandas.crandas.ReturnValue representing the series

as_table(*, column_name='', **query_args)

Outputs a CDataFrame that has the CSeries as its only column

Parameters:
  • column_name (str, optional) – name for the column in the resulting CDataFrame

  • query_args – See queryargs

Returns:

CDataFrame having the expected CSeries as its only column

Return type:

CDataFrame

as_value(**query_args)

Interpret single-row CSeries as value

Interpret single-row CSeries as a constant. For example, without using as_value, a["col2"]+b["col2"] performs row-wise addition. With as_value, a single-row CSeries is interpreted as a single value instead of as a column, e.g., a["col2"]+b["col2"].as_value() adds the value of the row col2 of table b to each row of a["col2"].

This function can be used to work with values that remain secret to the servers that perform the computation, e.g.:

data = cd.DataFrame({"a": [1, 2, 3]})

# The value "1" is a part of the function definition and so becomes known
# to the servers
data[data["a"] == 1]

# The value "1" is derived from a single-row column of a private table and
# so remains hidden to the servers
filtervalue = cd.DataFrame({"filtervalue": [1]})["filtervalue"].as_value()
data[data["a"] == filtervalue]
Parameters:

query_args – See queryargs

Returns:

ReturnValue representing the value of the (only row of the) CSeries

Return type:

ReturnValue

astype(ctype, validate=False)

Converts output to a specific type

Parameters:
  • ctype (Ctypes type specification) – Type to convert to. See Data types

  • validate (bool, default False) – If set, validate that the resulting column is of the correct type, e.g., is an 8-bit integer when tp=uint8.

Returns:

CSeries converted to given type

Return type:

CSeries

Raises:

EngineError – Conversion failed or not supported

b64decode()

Decodes a Base64-encoded string as a bytes column

b64encode()

Encodes a bytes column as a Base64-encoded string

bytes_to_hex()

Converts a bytes column to a lowercase hex string (like “3ec9”)

capitalize()

Returns string with the first character converted to uppercase and all other ones to lowercase

contains(other)

Substring search

Searches for substring in the column

Parameters:

other (crandas.CSeries) – Substring to search for

Returns:

Result of search: 1 if substring is found, 0 otherwise

Return type:

CSeriesFun

count(*, mode='open', **query_args)

Computes the number of not-NULL elements of the series

See sum() for a description of the arguments.

day()

Returns the day of the month

Returns:

day of the month

Return type:

int

day_of_year()

Returns the day of the year

Returns:

Number representing the day of the year

Return type:

int

dayofyear()

Returns the day of the year

Returns:

Number representing the day of the year

Return type:

int

encode()

Encodes an ASCII varchar column as a bytes column

fillna(nullval)

Replaces NULL values in the column by nullval

Parameters:

nullval (rowwise function) – Value to replace NULLs by

fullmatch(pattern, *args)

Regular expression matching

Matches column to a regular expression.

Parameters:
  • pattern (re.Re) – Regular expression to match

  • args (list of crandas.CSeries) – Additional columns for the match (can be referred to by (?1) to (?9) in the regular expression, e.g., r".*(?1).*")

Returns:

Column containing the result of the matching: 1 if there is a match and 0 otherwise.

Return type:

CSeriesFun

get(*, name='', **query_args)

Deprecated. Use CSeries.as_table() instead.

hex_to_bytes()

Converts a lowercase hex string (like “3ec9”) as a bytes column

if_else(ifval, elseval)

Allows values to be assigned with an if-else statement where self is the guard and has to be a column of bits; the value from ifval is selected for rows of self that have the value one and the value from elseval is selected for rows of self that have the value zero

Parameters:
  • ifval (int) – Value if true

  • elseval (int) – Value otherwise

inner(other)

Inner product of two vectors

isna()

Returns whether respective values are NULL, boolean inverse of notna

isnull()

Returns whether respective values are NULL, boolean inverse of notna

len()

Returns the character length of each element of the CSeries (only works for Cseries of type string or bytes)

Returns:

CSeries of character lengths (for string) or number of bytes (for bytes)

Return type:

CSeriesFun

lower(indices=None)

Returns string values in lowercase

Parameters:

indices (int list, optional) – Represents the letters to be modified, if None then the whole string is modified

Returns:

CSeries with lowercase strings

Return type:

CSeries

Raises:

ValueError – Invalid index for string length

max(*, mode='open', **query_args)

Computes the maximum of the series

See sum() for a description of the arguments.

mean(*, mode='open', **query_args)

Computes the mean of the elements of the series.

See sum() for a description of the arguments.

min(*, mode='open', **query_args)

Computes the minimum of the series

See sum() for a description of the arguments.

month()

Returns the month in number format

Returns:

month

Return type:

int

notna()

Returns whether respective values are not NULL, boolean inverse of isna

notnull()

Alias for isna

open()

Returns column in opened form

sqrt()

Returns the element-wise square root of this column, which should be numeric.

Note: in case there are negative numbers in the column, this

function will throw an error.

Returns:

CSeries of character lengths (for string) or number of bytes (for bytes)

Return type:

CSeriesFun

strip()

Returns stripped string values

substitute(sub_dict, output_size=None)

Performs string substitution in a string column. Inputs are provided in a dictionary of the form “a”: [“á”, “à”, “ä”] where the characters in the list [“á”, “à”, “ä”] will be substituted by the key “a”. Substitution of substrings of more than one character is not currently supported.

Parameters:
  • sub_dict (str Dictionary) – Dictionary where each key is the string (maybe be more than one character) to be added and each value is a list of characters to be substituted by the key

  • output_size (int, optional) – new max string length of the column, necessary when substituting a character for multiple ones, by default None

Returns:

CSeries with modified strings

Return type:

CSeries

Raises:
  • TypeError – Values of sub_dict in substitute need to be a list of characters

  • TypeError – Every character should match at most one substitution.

sum(*, mode='open', **query_args)

Computes the sum of the elements of the series.

Parameters:
  • mode (str) – mode in which to perform queries that return objects (“open” / “defer” / “regular”), by default “open”

  • query_args – See queryargs

Returns:

Result of applicable type, depending on as_table and mode

Return type:

int/Deferred/DataFrame/CDataFrame

sum_squares(*, mode='open', **query_args)

Computes the sum of squares of the elements of the series.

See sum() for a description of the arguments.

upper(indices=None)

Returns string values in uppercase

Parameters:

indices (int list, optional) – Represents the letters to be modified, if None then the whole string is modified

Returns:

CSeries with uppercase strings

Return type:

CSeries

Raises:

ValueError – Invalid index for string length

var(*, mode='open', **query_args)

Computes the variance of the series.

See sum() for a description of the arguments.

vsum()

Sum the elements of a vector

weekday()

Returns the day of the week, where Monday is 0

Returns:

Number representing the day of the week

Return type:

int

with_threshold(threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None)

Adds a threshold to the CSeries. When the column is used as a filtering column or in an aggregation operation, this threshold indicates the minimum number of items that need to be in the filtering result or have the aggregation taken over.

Parameters:

threshold (int, non-negative, optional) – minimum number of elements for operation to be allowed

year()

Returns the year

Returns:

year in 4 digits

Return type:

int

class crandas.crandas.CSeriesColRef(table, name, **kwargs)

Bases: CSeries

Column of CDataFrame

Subclass of CSeries. Represents a column of a CDataFrame df as accesed via df["colname"] or lambda x: x.colname.

all(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes whether all values of the boolean series are true

See sum() for a description of the arguments.

any(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes whether any value of the boolean series is true

See sum() for a description of the arguments.

as_table(*, column_name='', **query_args)

Outputs a CDataFrame that has the CSeries as its only column

Parameters:
  • column_name (str, optional) – name for the column in the resulting CDataFrame

  • query_args – See queryargs

Returns:

CDataFrame having the expected CSeries as its only column

Return type:

CDataFrame

count(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the number of not-NULL elements of the series

See sum() for a description of the arguments.

property ctype

Ctype for the column; see ctypes_schemas

get(*, name='', **query_args)

Deprecated. Use :meth:`.CSeriesColRef.as_table instead.

in_range(minval, maxval)

Validation that column values lie in specified range

Apples to numerical/numerical vector columns only

Parameters:
  • minval (int) – minimum (inclusive);

  • maxval (int) – maximum (inclusive)

Returns:

Validator for use in CDataFrame.validate

Return type:

Validation

max(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the maximum of the series

See sum() for a description of the arguments.

mean(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the mean of the elements of the series.

See sum() for a description of the arguments.

min(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the minimum of the series

See sum() for a description of the arguments.

property schema

Ctype for the column; see ctypes_schemas

sum(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the sum of the elements of the series

Parameters:
  • as_table (boolean, default: False) – if True, result is returned as DataFrame instead of value

  • threshold (int, non-negative, optional) – if given, only return value as long as the number of not-NULL elements is above the minimum threshold of elements for the operation

  • query_args – See queryargs

Returns:

Result of applicable type, depending on as_table and mode

Return type:

int/Deferred/DataFrame/CDataFrame

sum_in_range(minval, maxval)

Validation that sum of column values lies in specified range

Applies to integer/integer vector columns only

Parameters:
  • minval (int) – minimum (inclusive);

  • maxval (int) – maximum (inclusive)

Returns:

Validator for use in CDataFrame.validate

Return type:

Validation

sum_squares(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the sum of squares of the elements of the series.

See sum() for a description of the arguments.

var(*, as_table=False, threshold: MaybePlaceholder[Annotated[int, Ge(ge=0)]] | None = None, **query_args)

Computes the variance of the series.

See sum() for a description of the arguments.

class crandas.crandas.CSeriesFun(op, vals, args={}, **kwargs)

Bases: CSeries

Subclass of CSeries over which a function was applied to it

class crandas.crandas.Col(name, type='?', elperv=-1, nullable=False, constraints=None, modulus=None, _ctype=None, _schema=None, **kwargs)

Bases: object

Represents the type of a column.

The type and elperv fields can be equal to “?” and -1, respectively, to indicate that these are not known (e.g., for colums in an expected specification to vdl_query).

__eq__(other)

Checks structural equality between columns

__repr__()

Returns printable representation

property ctype

Ctype for the column; see ctypes_schemas

renamed(name)

Return copy of the Col with a different name

property schema

Schema for the column; see ctypes_schemas

exception crandas.crandas.CtypeSpuriousColumnsWarning(*, columns, **kwargs)

Bases: UserWarning

Warning that ctype contains spurious columns

For example, below, the ctype has a spurious column b:

cd.DataFrame({"a": [1]}, ctype={"a": "int", "b": "varchar"})
crandas.crandas.DataFrame(data=None, *args, ctype=None, schema=None, auto_bounds=None, **kwargs)

Creates a crandas dataframe.

This function calls the pandas DataFrame constructor pd.DataFrame(data, *args, ...), and uploads the resulting table using upload_pandas_dataframe().

Further arguments apart from data are passed to pd.DataFrame or upload_pandas_dataframe as appropriate. In particular, name can be used to specify a name for the table, and and auto_bounds can be used to disable arnings about automatically derived column bounds; see queryargs. Further, the ctype and schema arguments can be used to define columns and their types; see upload_pandas_dataframe().

When specifying data as a dict of Series arguments, it is possible to use crandas.Series() instead of pd.Series. This allows to specify the crandas ctype directly with the data.

Parameters:
  • data (any, default: None) –

    Contents of the dataframe, passed on to pd.DataFrame.

    If data is None and schema is given, an empty dataframe according to the given schema is given. Otherwise, an empty dataframe with no columns is created.

  • ctype (dict or str, default {}) – ctypes for columns; see ctypes_schemas

  • schema (dict, default None) – schema for uploaded data; see ctypes_schemas

  • auto_bounds (bool, optional) – See queryargs

Returns:

uploaded table

Return type:

CDataFrame

exception crandas.crandas.DataSpuriousColumnsWarning(*, columns, **kwargs)

Bases: UserWarning

Warning that data contains spurious columns

For example, below, the data has a spurious column b:

cd.DataFrame({"a": [1], "b": [2]}, schema={"a": "int"})
class crandas.crandas.ExpectDropResult(*, expected_len, **kwargs)

Bases: ResponseHandler

Response handler for drop response with given expected length

get_deferred(json_query, *, session)

Called when query is added to a transaction

Parameters:
  • json_q (JSON struct) – Query to be performed (as passed to vdl_query; in particular, with placeholders in place and without the signature used for authorization)

  • session (crandas.base.Session) – Session in which query is executed

Returns:

Return value to be provided to caller of vdl_query

Return type:

Deferred

get_dry_run_result(json_query, *, session)

Called when executing query in dry-run mode

Parameters:
  • json_q (JSON struct) – Query to be performed (as passed to vdl_query; in particular, with placeholders in place and without the signature used for authorization)

  • session (crandas.base.Session) – Session in which query is executed

Returns:

Return value to be provided to caller of vdl_query

Return type:

object

parse_response(json_query, json_answer, binary_data, prss_nonce, ix, *, session)

Called upon receiving a response to the query from the server

Parameters:
  • json_q (JSON struct) – Query to be performed (as passed to vdl_query; in particular, with placeholders in place and without the signature used for authorization)

  • json_a (JSON struct) – Answer received from server

  • binary_data (binary data stream, see crandas.queries.Query.getdata()) – Stream of binary data for answer

  • prss_nonce (str) – Server-supplied nonce for streaming uploads/downloads

  • ix (int) – Transaction index for masking (0 if not in transaction; otherwise: 1, 2, …)

Returns:

Return value to be provided to caller of vdl_query

Return type:

object

class crandas.crandas.ReturnValue(type, elperv, is_series, num_rows=None, **kwargs)

Bases: StateObject, CSeries

Represent a value or series of values computed by the engine

Various engine commands, e.g., CSeries.sum(), return values or series of values, as opposed to returning a DataFrame. This class is the analogue of CDataFrame that is used to represent such remote values.

A ReturnValue can be used as a CSeries, making it possible e.g. to filter on a value computed by the engine without having to open it. For example, the following filters all maximum elements without revealing the maximum: tab[tab["col"]==tab["col"].max(mode="regular")].

To obtain the value/series in the clear, call ReturnValue.open(). This returns a single value, unless .is_series is set, in which case it returns a Pandas series, which needs to have .num_rows rows if set.

clone()

Creates a clone of the StateObject. Only needed for objects that can be opened

The clone is used to open the object: when the object is opened, essentially, self.clone().json_to_opened(…) is called.

open_dry_run_result()

Return opened version of the object when run in dry-run mode

Should return an object of the same type as json_to_opened.

crandas.crandas.Series(*args, ctype=None, **kwargs)

Similar to pandas.Series, but additionally allows the user to specify a ctype

For example:

cd.DataFrame({"a": cd.Series([1,2,3], ctype="int16?")})
class crandas.crandas.Validation(table, col, json_desc)

Bases: object

Represents a validation that can be applied to a column.

Returned by functions like crandas.CSeriesColRef.in_range(), etc. Used as an argument to crandas.CDataFrame.validate().

json_desc can contain a combination of the following keys and values:
bounds ([int string, int string])

the lower and upper bounds of the data, represented as strings for arbitrary precision

sum_bounds ([int string, int string])

the lower and upper bounds of the sum of an array entry, represented as strings for arbitrary precision

precision (int)

fixed-point precision

is_array (bool)

boolean representing whether the column is an array

  • If the column is of type int, it can have the following keys: bounds, sum_bounds, is_array

  • If the column is of type fixed point, it can have the following keys: bounds, is_array, precision

crandas.crandas.choose_threshold(obj_threshold, arg_threshold)

Choose threshold where obj_threshold is given using the with_threshold() function (and hence, set in the CDataFrame or CSeries itself, and arg_threshold is specified as an argument to the aggregation function or filter() function.

Returns the threshold or None.

crandas.crandas.concat(tables_, *, ignore_index=True, axis=0, join='outer', **query_args)

Table concatenation Performs horizontal/vertical concatenation of tables, modelled on pandas pd.concat. Currently, only inner joins are suported for vertical concatenation. The first table defines the set of columns that the resulting table has. If join=”inner”, only columns common to all tables are included. Else, the remaining tables need to have the same set of columns as the first table (up to ordering), else an error is returned.

Parameters:
  • tables (list of CDataFrames) – One or more DataFrames to be concatenated

  • ignore_index (bool, optional) – does nothing, but is used in crandas.append, by default True

  • axis (int, optional) – Concatenation axis, 0=vertical, 1=horizontal, by default 0

  • join (str, optional) – type of join (currently only inner join is supported for vertical join), by default “outer”

  • query_args – See queryargs

Returns:

mode-dependent return table representing vertical/horizontal join

Return type:

CDataFrame

Raises:
  • RuntimeError – Received wrong inputs

  • NotImplementedError – Limited vertical concatenation is allowed, there must be a matching column on both tables to be concatenated

  • ValueError – Limited vertical concatenation is allowed, number of columns should be the same in all tables

  • RuntimeError – Horizontal join would create table with duplicate column names

crandas.crandas.cut(series, bins, *, labels, right=True, add_inf=False)

Bin values into discrete intervals (aka quantization)

Bins values into discrete intervals, a la pandas.cut. Quantizes series into bins [bins[0],bins[1]), [bins[1],bins[2]), etc, and returns the corresponding bin labels (so labels[0] for bin [bins[0],bins[1]), labels[1] for bin [bins[1],bins[2]), etc. The bins include the left edge and exclude the right edge.

The first bin should have -np.inf as left edge and the last bin should have np.inf as its right edge. If the argument add_inf is set to true, these edges are automatically added and do not need to be given as arguments.

The bins and labels can be given in the plain (e.g., cd.cut(cdf[“col”], [-np.inf, 0, 10, np.inf], [1, 2, 3]), or as columns providing respective bins and labels for the respective input rows (e.g., cd.cut(cdf[“col”], cd[“bins”], cd[“labels”])). In the latter case, the argument add_inf=True should be given.

Parameters:
  • series (CSeries) – series to apply quantization to

  • bins (int list, CSeries) – list of integers or int_vec column defining the bin edges

  • labels (int list) – list of integers or int_vec column defining the bin labels (one more element than bins if add_inf is True or one less otherwise)

  • right (bool) – specifies whether bins include their right edges

  • add_inf (bool) – when set to False, bins should include -np.inf and np.inf; when set to True they are automatically added. Can only be set to True when bins``is a ``CSeries

Return type:

  • CSeriesFun representing the result of the quantization

crandas.crandas.dataframe_to_command(df, ctype, *, auto_bounds, schema)

Turns DataFrame into an engine “new” command. Wrapper for table_to_command. :meta private:

crandas.crandas.demo_table(number_of_rows=1, number_of_columns=1, **query_args)

Create demo table.

Creates a demo table with the given number of rows and columns. The columns are respectively named “col1”, “col2”, … and have sequential integer values 1, 2, …

A nonce is included in the command so that every time this command is called, it receives a fresh table handle.

Parameters:
  • number_of_rows (int, optional) – Number of rows of resulting table, by default 1

  • number_of_columns (int, optional) – Number of columns of resulting table, by default 1

  • query_args – See queryargs

Returns:

A demo table with a fresh name

Return type:

CDataFrame

crandas.crandas.get_table(handle_or_name, /, *, schema=None, check=True, map_dummy_handles=None, session=None, **query_args)

Access a previously uploaded table by its handle or name.

The previously uploaded table is specified using the handle_or_name argument.

When get_table is called with a schema argument and/or from a recorded script, the retrieved table is checked against the schema that was specified or that was used when recording. If this schema check fails, a ValueError is raised. This check can be disabled by passing check=False.

Note that a name argument, if given, is interpreted as being part of the standard query arguments, and is thus interpreted as a target name for the result of the get query. Accordingly, get_table(“a”, name=”b”) can be used to assign the (additional) symbolic name “b” to the table with name or handle “a”.

Parameters:
  • handle_or_name (str) – Handle (hex-encoded string) or name. Gets interpreted as a handle if it is a 64 hexadecimal (uppercase) string, otherwise as a name.

  • schema (optional, default None) –

    Represents the structure of the table to be added. Needed if get_table is called from a Transaction, or if it is desired to check that the table corresponds to the given schema, by default None. The schema can be specified as:

    • schema dict (e.g., cdf.schema; see ctypes_schemas)

    • list of column names (e.g., ["col1", "col2"]]

    • pandas DataFrame

    • any valid argument to pandas.read_csv

    • CIndex (e.g., cdf.columns)

  • check (bool (default: True)) –

    Enables server-side validation of the object type and schema (if given).

    In scripts, if check==True, then the schema of the table is checked against the table that the script was recorded with. To disable this check, use check=False during the script recording.

  • map_dummy_handles (bool, optional) –

    Whenever a script is being recorded (see crandas.script), the default behavior is to interpret all calls to get_table(handle) as dummy_for:<handle> table names. This allows the user to use the same handle in both script recording and execution, even though the script recording takes place in a different environment where the real table handle does not exist.

    This behavior can be overridden in two levels: for the entire script or for a single call to get_table. For the entire script, mapping dummy handles can be disabled by supplying map_dummy_handles as False in the call to crandas.script.record(). For the call to get_table, by specifying this argument as either True or False, the mapping behavior is forced to be either enabled or disabled, regardless of the current script mode.

  • query_args – See queryargs

Returns:

The table with handle handle_or_name

Return type:

CDataFrame

Raises:

ValueError

  • Schema validation failed - Schema not specified (when performed in a transaction)

crandas.crandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, validate=None, suffixes=('_x', '_y'), session=None, **query_args)

Merge tables using a database-style join. Implements pandas.merge.

The following types of merge are supported:

  • inner join: returns only the rows where the join columns match

  • outer join: returns rows from both tables, matched where possible, starting with rows of left table in their original order

  • left join: return rows of left table in original order, matched with a row of the right table where possible

  • right join: return rows of right table, matched with a row of the left table where possible

Columns to join on are given either by a common on argument, or separate left_on and right_on arguments for the left and right tables. Depending on the arguments passed, the join columns can contain duplicates:

  • if on/left_on/right_on are column name(s), a one-to-one merge is performed, and an error is given if the merge columns contain duplicates;

  • if a CDataFrame.groupby() result is given as left_on argument, a many-to-one merge is performed where the left table can have duplicates. In this case, the servers learn the number of unique matching keys.

  • if a CDataFrame.groupby() result is given as right_on argument, a one-to-many merge is performed where the right table can have duplicates. In this case, the servers learn the number of unique matching keys.

  • many-to-many merges (where both tables contain duplicate values for the merge columns) are only supported with some leakage of information about the underlying data; see crandas.unsafe.merge_m2m().

Parameters:
  • left (CDataFrame) – Left table to be joined

  • right (CDataFrame) – Right table to be joined

  • how ("inner" (default), "outer", or "left", optional) – Type of join

  • on (str or list of str, optional) – Column(s) to join on; must be common to both tables

  • left_on (str, list of str, or CDataFrameGroupBy, optional) – Column(s) of the left table to join on

  • right_on (str or list of str, optional) – Column(s) of the right table to join on, by default None

  • validate (str, optional) – Can be “one_to_one”, “1:1”, “one_to_many”, “1:m”, “many_to_one” or “m:1”. If given, it is compared against the type of merged derived from the left_on and right_on arguments as discussed above, and an exception is raised if it is incorrect

  • query_args – engine query arguments

Returns:

Result of the merging operation

Return type:

CDataFrame

Raises:
  • MergeError – Values of the join columns are not unique

  • ValueError – Incorrect combination of arguments

crandas.crandas.pandas_dataframe_schema(df, ctype=None, auto_bounds=None, schema=None)

Determine schema for pandas DataFrame

Tries to encode the given data, and returns schema of crandas DataFrame that would result from calling crandas.crandas.upload_pandas_dataframe() with the passed ctype and schema. See ctypes_schemas.

Parameters:
Returns:

schema; see ctypes_schemas

Return type:

dict

crandas.crandas.parquet_to_command(pq_table, ctype, *, auto_bounds, schema)

Turns Parquet object into an engine “new” command. Wrapper for table_to_command. :meta private:

crandas.crandas.read_csv(file_name, **kwargs)

Upload the given CSV file to the engine

Internally calls pd.read_csv(file_name, ...) to read the CSV file and calls upload_pandas_dataframe(...)() on the resulting DataFrame.

Parameters:
  • file_name (str | PathLike) – name of the file

  • **kwargs (any) –

    • any keyword arguments to pd.read_csv

    • any arguments to upload_pandas_dataframe(); in particular, this includes query arguments (see queryargs) such as dummy_for for use during query design; and name to assign a name to the table

Returns:

uploaded table

Return type:

CDataFrame

crandas.crandas.read_csv_schema(file_name, ctype=None, auto_bounds=None, schema=None, **kwargs)

Determine schema for CSV file

Tries to load the CSV and encode it for use by crandas. Returns

schema of crandas DataFrame that would result from calling

crandas.crandas.read_csv(). See ctypes_schemas.

Parameters:
Returns:

schema; see ctypes_schemas

Return type:

dict

crandas.crandas.read_parquet(file_name, **kwargs)

Upload the given Apache Parquet file to the engine

If file_name is a pyarrow.Table, calls file_name.to_pandas(...) and then applies upload_pandas_dataframe() to the result. Otherwise, call ParquetFile(file_name, ...), and then apply upload_streaming_file() to the result. Keyword arguments **kwargs can be provided for any of these called functions.

Parameters:
  • file_name (str | Path-like object | pyarrow.Table) – name of the file or path to the file to be read

  • **kwargs (any) –

    • any keyword arguments to pyarrow.Table.to_pandas, ParquetFile, upload_pandas_dataframe(), or upload_streaming_file(), In particular, this includes query arguments (see queryargs) such as dummy_for for use during query design; and name to assign a name to the table

Returns:

uploaded table

Return type:

CDataFrame

crandas.crandas.remove_objects(objects, **query_args)

Remove objects from server

If the list of objects contained a non-existent object, an error is raised. Still, all objects from the list are removed.

Parameters:
Raises:

EngineError – Some of the given objects did not exist

crandas.crandas.series_max(col1, col2)

Compute the maximum of two CSeries

Parameters:
  • col1 (CSeries) – integer series;

  • col2 (CSeries) – integer series;

Returns:

the maximum of the values of col1 and col2

Return type:

integer CSeries

crandas.crandas.series_min(col1, col2)

Compute the minumum of two CSeries

Parameters:
  • col1 (CSeries) – integer series;

  • col2 (CSeries) – integer series;

Returns:

the minimum of the values of col1 and col2

Return type:

integer CSeries

crandas.crandas.upload_pandas_dataframe(df, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)

Uploads an existing pandas DataFrame into the engine

Parameters:
  • df (pandas.DataFrame) – DataFrame to upload

  • ctype (dict, default None) – ctypes for columns; see ctypes_schemas

  • session (cd.base.Session) – see queryargs

  • _keep (bool/None) – if False, table is not saved

  • auto_bounds (bool, optional) – See queryargs

  • schema (dict, default None) – schema for uploaded data; see ctypes_schemas

  • **query_args (query arguments) – see queryargs

:param **session** is an argument for an underlying function. For more information see queryargs.:

Returns:

the uploaded DataFrame

Return type:

CDataFrame

crandas.crandas.upload_streaming_file(file_reader, file_type, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)

Uploads a file into the engine. Instead of loading the entire file, a file reader is used to load the table columnwise when uploading. Columns are then deleted, using considerably less memory. Currently only works for parquet files.

Parameters:
  • file_reader (pq.ParquetFile) – file to upload

  • file_type (str) – a string describing the type of file of file_reader

  • ctype (dict, default: None) – explicitly given types for columns

  • name (str, optional) – name for the table; passed on to upload_pandas_dataframe() if given, by default None

  • auto_bounds (bool, optional) – See queryargs

  • schema (dict, default None) – schema for uploaded data; see ctypes_schemas

Returns:

the uploaded DataFrame

Return type:

CDataFrame