crandas.crandas
Main crandas functionality: dataframes (DataFrame), series (CSeries), and
analysis operations (e.g., merge())
CDataFrame = DataFrame
module-attribute
Deprecated
Use of this alias is deprecated
Alias for DataFrame
CIndex(cols, **kwargs)
Index (set of columns) of a crandas DataFrame
For a regular DataFrame, this represents the columns (name and type) of the DataFrame.
For a deferred DataFrame (in a transaction, or resulting from a dry run), this represents the columns (name and type) that the result of an operation is expected to have based on its inputs. For such an expected column, the name is set, but the type and size ("elements per value") may be undefined.
Constructor
| PARAMETER | DESCRIPTION |
|---|---|
cols
|
list of columns (Col)
TYPE:
|
ctype
property
Ctype corresponding to the CIndex; see Ctype Schemas
schema
property
Schema corresponding to the CIndex; see Ctype Schemas
__eq__(other)
Checks equality with input
__getitem__(ix)
Returns name of column ix
__len__()
Returns number of columns
get_loc(name)
Get integer location for requested label
| PARAMETER | DESCRIPTION |
|---|---|
name
|
column name label;
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
index of column with name name |
| RAISES | DESCRIPTION |
|---|---|
KeyError
|
value not found |
matches_template(expected)
Checks whether the number and names of columns fit a template
to_dict()
Returns column names in dictionary form
CSeries(**kwargs)
Bases: Summable
One dimensional array which represents either the column of a DataFrame
or the result of applying a rowwise function to one or more columns of a cd.DataFrame
notnull = notna
class-attribute
instance-attribute
Alias for isna
DT(outer_instance)
Used to retrieve date units in the pandas way
all(*, mode='open', **query_args)
Computes whether all values of the boolean series are true
See here for a description of the arguments.
any(*, mode='open', **query_args)
Computes whether any value of the boolean series is true
See here for a description of the arguments.
as_series(**query_args)
Return crandas.crandas.ReturnValue representing the series
as_table(*, column_name='', **query_args)
Outputs a .DataFrame that has the CSeries as its only column
| PARAMETER | DESCRIPTION |
|---|---|
column_name
|
name for the column in the resulting
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
|
as_value(**query_args)
Interpret single-row CSeries as value
Interpret single-row CSeries as a constant. For example, without using
as_value, a["col2"]+b["col2"] performs row-wise addition. With
as_value, a single-row CSeries is interpreted as a single value
instead of as a column, e.g., a["col2"]+b["col2"].as_value() adds
the value of the row col2 of table b to each row of
a["col2"].
This function can be used to work with values that remain secret to the servers that perform the computation, e.g.:
data = cd.DataFrame({"a": [1, 2, 3]})
# The value "1" is a part of the function definition and so becomes known
# to the servers
data[data["a"] == 1]
# The value "1" is derived from a single-row column of a private table and
# so remains hidden to the servers
filtervalue = cd.DataFrame({"filtervalue": [1]})["filtervalue"].as_value()
data[data["a"] == filtervalue]
| PARAMETER | DESCRIPTION |
|---|---|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReturnValue
|
ReturnValue representing the value of the (only row of the) CSeries |
astype(ctype, validate=False)
Converts output to a specific type
| PARAMETER | DESCRIPTION |
|---|---|
ctype
|
Type to convert to. See data-types
TYPE:
|
validate
|
If set, validate that the resulting column is of the correct type,
e.g., is an 8-bit integer when
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries converted to given type |
| RAISES | DESCRIPTION |
|---|---|
EngineError
|
Conversion failed or not supported |
b64decode()
Decodes a Base64-encoded string as a bytes column
b64encode()
Encodes a bytes column as a Base64-encoded string
bytes_to_hex()
Converts a bytes column to a lowercase hex string (like "3ec9")
capitalize()
Returns string with the first character converted to uppercase and all other ones to lowercase
contains(other)
Substring search
Searches for substring in the column
| PARAMETER | DESCRIPTION |
|---|---|
other
|
Substring to search for
|
| RETURNS | DESCRIPTION |
|---|---|
CSeriesFun
|
Result of search: |
corr(other, method='pearson', min_periods=None)
Compute correlation with other CSeries. This process computes the Pearson
correlation coefficient, which is a measure of linear correlation. For more
information, see https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.
| PARAMETER | DESCRIPTION |
|---|---|
other
|
CSeries with which to compute the correlation.
TYPE:
|
method
|
Method used to compute the correlation. Currently only "pearson" is supported.
TYPE:
|
min_periods
|
Minimum number of observations needed to have a valid result.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
Correlation with other. |
count(*, mode='open', **query_args)
Computes the number of not-NULL elements of the series
See here for a description of the arguments.
day()
Returns the day of the month
| RETURNS | DESCRIPTION |
|---|---|
int
|
day of the month |
day_of_year()
Returns the day of the year
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number representing the day of the year |
dayofyear()
Returns the day of the year
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number representing the day of the year |
drop_duplicates(*, keep='first', inplace=False, ignore_index=True)
Remove duplicate rows from CSeries.
| PARAMETER | DESCRIPTION |
|---|---|
keep
|
Determines which duplicates to keep. * 'first': only keep the first occurence. * 'last': only keep the last occurence. * 'any': keep a single random occurrence, this one is more efficient. * False: remove all occurences.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the duplicates removed |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
keep must be either "first", "last", "any" or False |
encode(encoding='utf-8', *, output_size=None)
Encodes an ASCII varchar column as a bytes column
| PARAMETER | DESCRIPTION |
|---|---|
encoding
|
The encoding in which to encode the string column. Currently, only "utf-8" is supported.
TYPE:
|
output_size
|
Maximum bytes length of the output column. By default equal to twice the maximum string length of the input column. Needs to be set when this default is not large enough, which can happen when the input contains many complex Unicode characters. Can also be set to a lower value to improve the performance of later operations.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the encoded bytes column |
fillna(nullval)
Replaces NULL values in the column by nullval
| PARAMETER | DESCRIPTION |
|---|---|
nullval
|
Value to replace NULLs by
TYPE:
|
filter_chars(chars)
Filters strings according to a set of permitted characters
Example
| PARAMETER | DESCRIPTION |
|---|---|
chars
|
Permitted characters
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with filtered strings |
fullmatch(pattern, *args)
Regular expression matching
Matches column to a regular expression.
| PARAMETER | DESCRIPTION |
|---|---|
pattern
|
Regular expression to match
TYPE:
|
args
|
Additional columns for the match (can be referred to by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeriesFun
|
Column containing the result of the matching: |
get(*, name='', **query_args)
Deprecated. Use .as_table() instead.
hex_to_bytes()
Converts a lowercase hex string (like "3ec9") as a bytes column
if_else(ifval, elseval, **query_args)
Select values from ifval or elseval depending on self
Acts as an if-else statement by selecting the value from ifval for rows where self
is equal to one, and selecting the value from elseval for rows where self is equal
to zero.
For example, (cdf["a"]>0).if_else(cdf["a"], 0) returns a CSeries that has the
value cdf["a"] if the condition (cdf["a"]>0) is satisfied, and the value 0 otherwise.
Use for example as cdf.assign(a=(cdf["a"]>0).if_else(cdf["a"], 0)).
This function can also be applied to two instances of DataFrame. This is useful
for assigning values to multiple columns at the same time based on the same condition.
In this case, a DataFrame is returned that, for
each row, contains the values from ifval for rows where self is equal to one, and
the values from elseval for rows where self is equal to zero. The arguments need to
have the same set of columns.
In this latter variant, if the order of the columns of ifval and elseval is
different, the columns are returned in the order of ifval. For example,
(cdf["a"]>0).if_else(cdf, cdf2) select values from the dataframe cdf whenever
(cdf["a"]>0), and from cdf2 otherwise. Since if_else is applied rowwise, it is
important to ensure that the order of the rows in self, ifval, and elseval is
the same, e.g., elseval should not be obtained from ifval by a filtering, by an
inner join, or similar.
Tip
cond.if_else(ifval, elseval) is equivalent to ifval.where(cond, elseval).
| PARAMETER | DESCRIPTION |
|---|---|
ifval
|
Value(s) if true
TYPE:
|
elseval
|
Value(s) otherwise
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame (if both arguments are DataFrame) or CSeries (otherwise)
|
Values from |
inner(other)
Inner product of two vectors
isin(values)
Check per row whether value is contained in the given list of values.
For example cdf["a"].isin([1,2,3]) checks, for each row, whether
the value of the column a is equal to 1, 2, or 3. The provided
values can also be functions, that are evaluated row-wise. For example,
cdf["a"].isin([cdf["b"], cdf["c"]]) checks for each row whether the
value of column a in that row is equal to the value of column b
in that row or the value of column c in that row.
It is possible to use placeholders for individual values of the list
(e.g., cdf["a"].isin([1, Any(2)]) is allowed) but not for the list
as a whole (e.g., cdf["a"].isin(Any([1,2])) is not allowed).
| PARAMETER | DESCRIPTION |
|---|---|
values
|
List of values to check against. Each value
needs to be a valid rowwise function (e.g., a constant,
a column
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
Rowwise function (for use in filter, assign, ...) representing whether the given value is in the given list |
isna()
Returns whether respective values are NULL, boolean inverse of notna
len()
Returns the character length of each element of the CSeries (only works for Cseries of type string or bytes)
| RETURNS | DESCRIPTION |
|---|---|
CSeriesFun
|
CSeries of character lengths (for string) or number of bytes (for bytes) |
lower(indices=None)
Returns string values in lowercase
| PARAMETER | DESCRIPTION |
|---|---|
indices
|
Represents the letters to be modified, if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with lowercase strings |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
Invalid index for string length |
mask(cond, other=pd.NA, *, inplace=False, axis=None, level=None)
Replace value where the condition is True.
Replacement is performed row-wise. Returns a CSeries that can be used
with .assign(), etc.
Tip
elseval.mask(cond, ifval) is equivalent to cond.if_else(ifval, elseval).
| PARAMETER | DESCRIPTION |
|---|---|
cond
|
Where
TYPE:
|
other
|
Entries where
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
Values from |
max(*, mode='open', **query_args)
Computes the maximum of the series
See here for a description of the arguments.
mean(*, mode='open', **query_args)
Computes the mean of the elements of the series.
See here for a description of the arguments.
min(*, mode='open', **query_args)
Computes the minimum of the series
See here for a description of the arguments.
month()
Returns the month in number format
| RETURNS | DESCRIPTION |
|---|---|
int
|
month |
notna()
Returns whether respective values are not NULL, boolean inverse of isna
open(*, limit=None, offset=None)
Returns column in opened form
| PARAMETER | DESCRIPTION |
|---|---|
limit
|
Limit to the number of opened rows. The number of returned rows will be the minimum of the number of rows remaining and the provided limit.
TYPE:
|
offset
|
Offset of the opened rows. Start returning rows from the provided 0-based index.
TYPE:
|
sqrt()
Returns the element-wise square root of this column, which should be numeric.
Note: in case there are negative numbers in the column, this function will throw an error.
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries containing the square roots of the inputs |
std(*, mode='open', ddof=1, **query_args)
Computes the standard deviation of the series.
See here for a description of the arguments.
| PARAMETER | DESCRIPTION |
|---|---|
ddof
|
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample std and a ddof of 0 to the population std.
TYPE:
|
strip()
Returns stripped string values
substitute(sub_dict, output_size=None)
Performs string substitution in a string column.
Inputs are provided in a dictionary of the form "a": ["á", "à", "ä"]
where the characters in the list ["á", "à", "ä"] will be substituted by the key "a".
Substitution of substrings of more than one character is not currently supported.
| PARAMETER | DESCRIPTION |
|---|---|
sub_dict
|
Dictionary where each key is the string (maybe be more than one character) to be added and each value is a list of characters to be substituted by the key
TYPE:
|
output_size
|
Maximum string length of the output column. By default equal to the maximum string length of the input column. Needs to be set when this default is not large enough, which can happen when substituting a character for multiple ones.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with modified strings |
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
Values of sub_dict in |
TypeError
|
Every character should match at most one substitution. |
sum(*, mode='open', **query_args)
Computes the sum of the elements of the series.
| PARAMETER | DESCRIPTION |
|---|---|
mode
|
mode in which to perform queries that return objects ("open" / "defer" / "regular"), by default "open"
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int / float / Deferred / DataFrame / DataFrame
|
Result of applicable type, depending on as_table and mode. Note that the return value is a regular Python int/float, rather than a numpy int/float. |
sum_squares(*, mode='open', **query_args)
Computes the sum of squares of the elements of the series.
See here for a description of the arguments.
upper(indices=None)
Returns string values in uppercase
| PARAMETER | DESCRIPTION |
|---|---|
indices
|
Represents the letters to be modified, if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with uppercase strings |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
Invalid index for string length |
var(*, mode='open', ddof=1, **query_args)
Computes the variance of the series.
See here for a description of the arguments.
| PARAMETER | DESCRIPTION |
|---|---|
ddof
|
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample variance and a ddof of 0 to the population variance.
TYPE:
|
vsum()
Sum the elements of a vector
weekday()
Returns the day of the week, where Monday is 0
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number representing the day of the week |
where(cond, other=pd.NA, *, inplace=False, axis=None, level=None)
Replace value where the condition is False.
Replacement is performed row-wise. Returns a CSeries that can be used
with .assign(), etc.
Tip
ifval.where(cond, elseval) is equivalent to cond.if_else(ifval, elseval).
| PARAMETER | DESCRIPTION |
|---|---|
cond
|
Where
TYPE:
|
other
|
Entries where
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
Values from |
with_threshold(threshold)
Adds a threshold to the CSeries. When the column is used as a filtering column or in an aggregation operation, this threshold indicates the minimum number of items that need to be in the filtering result or have the aggregation taken over.
| PARAMETER | DESCRIPTION |
|---|---|
threshold
|
minimum number of elements for operation to be allowed
TYPE:
|
withnull(nulls)
Set values indicated by nulls to NULL
| PARAMETER | DESCRIPTION |
|---|---|
nulls
|
Indicates with Trues which rows to replace by NULL
TYPE:
|
year()
Returns the year
| RETURNS | DESCRIPTION |
|---|---|
int
|
year in 4 digits |
CSeriesColRef(table, name, **kwargs)
Bases: CSeries
Column of DataFrame
Subclass of CSeries. Represents a column of a DataFrame df as accessed via
df["colname"] or lambda x: x.colname.
ctype
property
Ctype for the column; see Ctype Schemas
schema
property
Ctype for the column; see Ctype Schemas
all(*, as_table=False, threshold=None, **query_args)
Computes whether all values of the boolean series are true
See here for a description of the arguments.
any(*, as_table=False, threshold=None, **query_args)
Computes whether any value of the boolean series is true
See here for a description of the arguments.
count(*, as_table=False, threshold=None, **query_args)
Computes the number of not-NULL elements of the series
See here for a description of the arguments.
get(*, name='', **query_args)
Deprecated. Use .as_table() instead.
See documentation of CSeries.get for details about deprecation.
in_range(minval, maxval)
Validation that column values lie in specified range
Applies to numerical/numerical vector columns only
| PARAMETER | DESCRIPTION |
|---|---|
minval
|
minimum (inclusive);
|
maxval
|
maximum (inclusive)
|
| RETURNS | DESCRIPTION |
|---|---|
Validation
|
Validator for use in crandas.crandas.DataFrame.validate |
ix()
Returns the index of the column
max(*, as_table=False, threshold=None, **query_args)
Computes the maximum of the series
See here for a description of the arguments.
mean(*, as_table=False, threshold=None, **query_args)
Computes the mean of the elements of the series.
See here for a description of the arguments.
min(*, as_table=False, threshold=None, **query_args)
Computes the minimum of the series
See here for a description of the arguments.
std(*, ddof=1, as_table=False, threshold=None, **query_args)
Computes the standard deviation of the series.
See here for a description of the arguments.
| PARAMETER | DESCRIPTION |
|---|---|
ddof
|
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample std and a ddof of 0 to the population std.
TYPE:
|
sum(*, as_table=False, threshold=None, **query_args)
Computes the sum of the elements of the series
| PARAMETER | DESCRIPTION |
|---|---|
as_table
|
if True, result is returned as pd.DataFrame instead of value
TYPE:
|
threshold
|
if given, only return value as long as the number of not-NULL elements is above the minimum threshold of elements for the operation
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int / float / Deferred / DataFrame / DataFrame
|
Result of applicable type, depending on |
sum_in_range(minval, maxval)
Validation that sum of column values lies in specified range
Applies to integer/integer vector columns only
| PARAMETER | DESCRIPTION |
|---|---|
minval
|
minimum (inclusive);
|
maxval
|
maximum (inclusive)
|
| RETURNS | DESCRIPTION |
|---|---|
Validation
|
Validator for use in crandas.crandas.DataFrame.validate |
sum_squares(*, as_table=False, threshold=None, **query_args)
Computes the sum of squares of the elements of the series.
See here for a description of the arguments.
type()
Returns the type of a column
var(*, ddof=1, as_table=False, threshold=None, **query_args)
Computes the variance of the series.
See here for a description of the arguments.
| PARAMETER | DESCRIPTION |
|---|---|
ddof
|
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, with N the number of rows. A ddof of 1 corresponds to the sample variance and a ddof of 0 to the population variance.
TYPE:
|
CSeriesFun(op, vals, args={}, **kwargs)
Col(name, type='?', elperv=-1, nullable=False, constraints=None, modulus=None, _ctype=None, _schema=None, **kwargs)
Represents the type of a column.
The type and elperv fields can be equal to "?" and -1, respectively,
to indicate that these are not known (e.g., for columns in an expected
specification to vdl_query).
| PARAMETER | DESCRIPTION |
|---|---|
name
|
column name
TYPE:
|
type
|
column type, this can be "?" if the column type is not known (for example, a function return value of a Transaction).
TYPE:
|
elperv
|
number of secret-shared elements that represent a single column value. This can be -1 if not known.
TYPE:
|
ctype
property
Ctype for the column; see Ctype Schemas
schema
property
Schema for the column; see Ctype Schemas
__eq__(other)
Checks structural equality between columns
__repr__()
Returns printable representation
renamed(name)
Return copy of the Col with a different name
CtypeSpuriousColumnsWarning(*, columns, **kwargs)
DataFrame(data=None, *args, ctype=None, schema=None, auto_bounds=None, **kwargs)
Bases: StateObject
Dataframe stored in the engine
The DataFrame class provides access to tables stored in the engine using an API
modeled upon pandas DataFrame.
The constructor creates and uploads a DataFrame similarly to Pandas. A DataFrame may
alternatively be obtained:
- by uploading data into the engine using
read_csv/read_parquetorupload_pandas_dataframe - by accessing an earlier uploaded table using
get_table
To see detailed information about the columns of a DataFrame, use the .columns attribute.
The constructor calls the pandas DataFrame constructor
pd.DataFrame(data, *args, ...), and uploads the resulting table using
upload_pandas_dataframe.
Further arguments apart from data are passed to pd.DataFrame or
upload_pandas_dataframe as appropriate.
In particular, name can be used to specify a name for the table, and
and auto_bounds can be used to disable warnings about automatically
derived column bounds; see Query Arguments.
Further, the ctype and schema arguments can be used to define
columns and their types; see upload_pandas_dataframe.
When specifying data as a dict of Series arguments, it is
possible to use crandas.crandas.Series instead of pd.Series. This
allows to specify the crandas ctype directly with the data.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Contents of the dataframe, passed on to If
TYPE:
|
ctype
|
ctypes for columns; see Ctype Schemas
TYPE:
|
schema
|
schema for uploaded data; see Ctype Schemas
DEFAULT:
|
auto_bounds
|
See Query Arguments
TYPE:
|
columns = ObjectProperty('_columns')
class-attribute
instance-attribute
Columns of the DataFrame, call .columns.cols to iterate; see crandas.crandas.CIndex
ctype
property
Ctype for the table; see Ctype Schemas
nrows = ObjectProperty('_nrows')
class-attribute
instance-attribute
Number of rows of the DataFrame
schema
property
Schema for the table; see Ctype Schemas
shape
property
Returns a pair with the number of rows and number of columns
__getitem__(key)
Implements df[key]
- If
keyis aCSeriesor a function, call.filter(). - If
keyis a list, call crandas.crandas.DataFrame.project - If
keyis astr, return a CSeries representing the column with the given name - If
keyis a slice, callDataFrame.slice()
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
the key must be one of the accepted types |
__setitem__(key, value)
Implements self[key] = value. Updates the dataframe inplace with the new column.
| PARAMETER | DESCRIPTION |
|---|---|
key
|
Column name to be assigned.
TYPE:
|
value
|
Value to assign to the specified column. See
TYPE:
|
add_prefix(prefix)
Implements pandas.DataFrame.add_prefix
| PARAMETER | DESCRIPTION |
|---|---|
prefix
|
prefix to be added
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of dataframe where prefix is added to all column names |
add_suffix(suffix)
Implements pandas.DataFrame.add_suffix
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
suffix to be added
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of dataframe where suffix is added to all column names |
append(other, ignore_index=True)
Implements pandas.DataFrame.append by calling crandas.concat accordingly
| PARAMETER | DESCRIPTION |
|---|---|
other
|
The data to append.
TYPE:
|
ignore_index
|
If True, the resulting axis will be labeled 0, 1, …, n - 1., currently only True is allowed, by default True
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The concatenated table |
assign(query_args=None, **assignments)
Implements pandas.DataFrame.assign. Assigns new columns to a DataFrame,
and outputs a new DataFrame with the new columns.
Assigned values need to be CSeries (or callable providing a CSeries), assignment of clear Series/array not supported.
To pass query arguments such as the target table name, specify them as values of the query_args dict, e.g.:
cdf.assign(newcol=1, query_args={"name": "tablename"})
To create a column whose name could be an engine query argument, add a query_args argument to disambiguate, e.g.:
cdf.assign(name="Column value", query_args={})
Avoid passing a generic dictionary like cdf.assign(**assignments) as
this is ambiguous when assignments contains a column that is also a
query_arg. Use crandas.crandas.DataFrame.assign_dict for this instead.
assign_dict(assignments, *, query_args=None, inplace=False)
Assigns new columns to a DataFrame, and outputs a new DataFrame with the
new columns or updates the current DataFrame inplace. Similar to .assign(),
but takes the assignments as a dict instead of as keywords and allows for inplace operation.
| PARAMETER | DESCRIPTION |
|---|---|
assignments
|
Dictionary containing the assignments.
TYPE:
|
query_args
|
See Query Arguments. Note that this is taken as a dictionary.
TYPE:
|
inplace
|
Boolean indicating whether the operation should be performed inplace.
When False, a new
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The updated dataframe. |
astype(ctype=None, *, schema=None, validate=False, **query_args)
Converts DataFrame to specified ctype or schema
If a ctype is given, convert the respective columns to the respective types, keeping other columns intact. If a schema is given, ensure that the resulting table conforms to the schema in terms of columns, orders, and types (dropping columns from the column if they do not occur in the schema).
| PARAMETER | DESCRIPTION |
|---|---|
ctype
|
ctypes for columns; see Ctype Schemas, by default None
TYPE:
|
schema
|
schema for uploaded data; see Ctype Schemas, by default None
TYPE:
|
validate
|
whether to validate the conversion, by default False
TYPE:
|
query_args
|
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with the set ctypes and columns |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
Provide ctype or schema, but not both |
ValueError
|
Need to specify ctype or schema |
corr(method='pearson', min_periods=None, numeric_only=False, dropna=False)
Compute pairwise correlation of columns. This process computes the Pearson correlation coefficient, which is a measure of linear correlation. For more information, see here.
| PARAMETER | DESCRIPTION |
|---|---|
method
|
Method used to compute the correlation. Currently only "pearson" is supported.
TYPE:
|
min_periods
|
Minimum number of observations needed to have a valid result.
TYPE:
|
numeric_only
|
if
TYPE:
|
dropna
|
if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe of the correlation matrix of numeric columns. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
At least two numeric columns are needed to compute correlation. |
describe()
Generate descriptive statistics
drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=True)
Remove duplicate rows from DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
subset
|
If a string iterable is given, use these as the name of the columns for identifying duplicates. If a string is given, use this as the name of the single column to identify duplicates. If no subset is given, all columns are used to identify duplicates.
DEFAULT:
|
keep
|
Determines which duplicates to keep. * 'first': only keep the first occurrence. * 'last': only keep the last occurrence. * 'any': keep a single random occurrence, this one is more efficient. * False: remove all occurences.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with the duplicates removed |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
keep must be either "first", "last", "any" or False |
ValueError
|
Subset should be either None, or consist of column names of the table. |
dropna(axis=0, subset=None, **query_args)
Remove null values from a DataFrame.
The axis determines whether rows (0) or columns (1) are removed. Only rows is implemented.
Returns a DataFrame with no nullable columns or values.
| PARAMETER | DESCRIPTION |
|---|---|
axis
|
determines whether rows (0) or columns (1) are deleted, by default 0. Only row method is allowed
TYPE:
|
subset
|
When present, remove the null values only from the provided columns.
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The original dataframe with all rows or columns (depending on |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
Dropping columns is not implemented for structural reason |
ValueError
|
Error whenever a wrong axis is entered |
fillna(value, **query_args)
Fill NULL values
Replace NULL values of nullable columns as specified by the value
argument. The resulting columns are not nullable anymore.
If value is a valid argument to crandas.crandas.CSeries.fillna (e.g., an
integer, a column, or a function), then this argument is applied to fill
in NULLs in all nullable columns of the table (and so value needs to
be of the correct type for all nullable columns).
If value is a dict, then the respective dictionary values are
provided (as above) to fill in NULLs for the respective column.
If value is a DataFrame, then the respective columns of value
are used to fill in NULLs for the corresponding columns of self.
This is done only for columns that occur in both dataframes.
| PARAMETER | DESCRIPTION |
|---|---|
value
|
Values to fill in for NULLs (see above)
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of table with NULLs replaced as indicated |
filter(key, threshold=None, **query_args)
Filter table
Returns table with all rows of the original table satisfying the criterion
represented by key.
key can be a CSeries representing a table column or a computation on table
row(s). In this case, the CSeries values need to be 1 (indicating that the
corresponding row will be selected) or 0 (indicating that the row will not
be selected).
If the CSeries used for indexing has a threshold (see CSeries.with_threshold),
the filtered result is only returned if it has the minimum number of rows as
indicated by the threshold.
Alternatively, key can be a function to be applied to the table columns.
The function is called with one argument representing the table, of which
the fields correspond to the columns. E.g., key lambda x: x.col1==1
represents the function that checks whether the value of column with name
col1 equals one.
See function_to_json for more information.
| PARAMETER | DESCRIPTION |
|---|---|
key
|
Filter criterion
TYPE:
|
threshold
|
If given, sets a minimum amount of rows that the resulting table
needs to have; otherwise, the server returns an error. Equivalent
to calling filter with
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The filtered table |
groupby(cols, *, dropna=True, bypass_same_value_check=False, threshold=None, **query_args)
Computes a grouping of the table by the values of (a) given column(s). Returns a grouping object that can be used in aggregation (see CSeriesGroupBy) or as an argument to crandas.merge().
| PARAMETER | DESCRIPTION |
|---|---|
cols
|
If a string iterable is given, use these as the name of the columns
to group by. If a string is given, use this as the name of the
single column to group by. If a
TYPE:
|
dropna
|
If True, any rows with null values will be dropped. If False, null values will be treated as a separate key in groups. Currently, only False is supported when using nullable columns.
DEFAULT:
|
bypass_same_value_check
|
Unless set, if the given cols is a
DEFAULT:
|
threshold
|
If given, only succeed as long as all groupings have at least this many elements.
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrameGroupBy
|
Grouping object |
max(axis=0)
Computes the maximum of each (numeric) column
| PARAMETER | DESCRIPTION |
|---|---|
axis
|
Which axis of the dataframe, only 0 is implemented, by default 0
TYPE:
|
min(axis=0)
Computes the minimum of each (numeric) column
| PARAMETER | DESCRIPTION |
|---|---|
axis
|
Which axis of the dataframe, only 0 is implemented, by default 0
TYPE:
|
open(*, limit=None, offset=None, **query_args)
Returns table in opened form
| PARAMETER | DESCRIPTION |
|---|---|
limit
|
Limit to the number of opened rows. The number of returned rows will be the minimum of the number of rows remaining and the provided limit.
TYPE:
|
offset
|
Offset of the opened rows. Start returning rows from the provided 0-based index.
TYPE:
|
project(cols, **query_args)
Project table
Returns table with same rows but a selection of columns
| PARAMETER | DESCRIPTION |
|---|---|
cols
|
Columns to select. Can be empty. Columns can occur multiple times.
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The projected table |
rename(columns, **query_args)
Implements pandas.DataFrame.rename
Only renaming of columns via columns argument is supported.
| PARAMETER | DESCRIPTION |
|---|---|
columns
|
dictionary of columns to be renamed of the form {"oldname": "newname"}
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with updated column names |
sample(*, n=None, frac=None, random_state=None, **query_args)
Samples rows from the dataframe.
The number of rows can be specified either as an integer n or a
fraction frac. The case frac==1 corresponds to returning a shuffling
of the table and is equivalent to crandas.crandas.DataFrame.shuffle.
If a random_state is given, the sampling is performed in a
deterministic way and according to a public selection (i.e., known to the
servers and predictable to the client); otherwise, the sampling is
non-deterministic and private (not known to the client and servers).
See also crandas.crandas.DataFrame.shuffle.
| PARAMETER | DESCRIPTION |
|---|---|
n
|
Number of rows to sample
DEFAULT:
|
frac
|
Proportion of rows (between 0.0 and 1.0, inclusive) to sample
DEFAULT:
|
random_state
|
Seed for deterministic sampling (otherwise is non-deterministic)
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of the table with rows sampled |
set_axis(labels, *, axis=0, copy=True, **query_args)
Implements pandas.DataFrame.set_axis
Set column names to specified list.
For consistency with pandas, axis=1 or axis='columns' needs to be
explicitly specified.
The 'copy' argument is ignored.
| PARAMETER | DESCRIPTION |
|---|---|
labels
|
List of column names
TYPE:
|
axis
|
Axis to update; needs to be set to 1 or
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with updated column names |
shuffle(*, random_state=None, **query_args)
Return table with rows shuffled. If a random_state is given, the
shuffle is deterministic and performed according to a public permutation
(i.e., known to the servers and predictable to the client); otherwise,
the shuffle is non-deterministic and private (not known to the client
and servers).
| PARAMETER | DESCRIPTION |
|---|---|
random_state
|
Seed for deterministic shuffle (otherwise is non-deterministic)
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of the table with rows shuffled |
slice(key, allow_fractions=False, **query_args)
Slice table
Returns table with same columns but a selection of rows
| PARAMETER | DESCRIPTION |
|---|---|
key
|
Python slice object representing rows to select
TYPE:
|
allow_fractions
|
If set to True, the
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The sliced table |
sort_values(by, *, ascending=True, **query_args)
Sorts the dataframe according to the values in the column by.
Currently, sorting on strings is not supported.
| PARAMETER | DESCRIPTION |
|---|---|
by
|
The column to sort on
TYPE:
|
ascending
|
Sort in ascending order if True, in descending order if False.
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Copy of the sorted table |
validate(*validations, **query_args)
Applies input validation to the table.
Input validation leads to a table that has the validations as constraints on the respective columns (e.g., checking that a column contains values in [0,2] leads to a column with values constrained to that domain). These constraints can be inspected by accessing tab.columns.cols[i].constaints.
Validations are instances of the crandas.crandas.Validation class and can be set by
calling validation functions such as in_range() and
sum_in_range().
| PARAMETER | DESCRIPTION |
|---|---|
*validations
|
Validations to apply to the table
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
If all validations have succeeded: copy of the table having the validations as constraints |
vdl_query_inplace(cmd, expected, *, inplace=False, session=None, **query_args)
Wrapper around vdl_query allowing for inplace operations
| PARAMETER | DESCRIPTION |
|---|---|
cmd
|
See
TYPE:
|
expected
|
Expected columns of the result.
TYPE:
|
inplace
|
Boolean indicating whether an inplace update should be performed.
When True, self will contain the result. When False, a new
TYPE:
|
session
|
Session object to use. If
TYPE:
|
**query_args
|
See Query Arguments.
TYPE:
|
DataSpuriousColumnsWarning(*, columns, **kwargs)
ExpectDropResult(*, expected_len, **kwargs)
Bases: ResponseHandler
Response handler for drop response with given expected length
ReturnValue()
Bases: StateObject, CSeries
Represent a value or series of values computed by the engine
Various engine commands, e.g., .sum(), return values or series
of values, as opposed to returning a DataFrame. This class is the analogue
of DataFrame that is used to represent such remote values.
A ReturnValue can be used as a CSeries, making it possible e.g.
to filter on a value computed by the engine without having to open it. For
example, the following filters all maximum elements without revealing the
maximum: tab[tab["col"]==tab["col"].max(mode="regular")].
To obtain the value/series in the clear, call: crandas.crandas.ReturnValue.open.
This returns a single value, unless .is_series is set, in which case it
returns a Pandas series, which needs to have .num_rows rows if set.
Validation(table, col, json_desc)
Represents a validation that can be applied to a column.
Returned by functions like in_range(), etc. Used as an argument
to .validate.
json_desc can contain a combination of the following keys and values:
bounds([int string, int string]) the lower and upper bounds of the data, represented as strings for arbitrary precisionsum_bounds([int string, int string]) the lower and upper bounds of the sum of an array entry, represented as strings for arbitrary precisionprecision(int) fixed-point precisionis_array(bool) boolean representing whether the column is an array
If the column is of type int, it can have the following keys: bounds, sum_bounds, is_array
If the column is of type fixed point, it can have the following keys: bounds, is_array, precision
Series(*args, ctype=None, **kwargs)
choose_threshold(obj_threshold, arg_threshold)
Choose threshold where obj_threshold is given using the
with_threshold() function (and hence, set in the DataFrame or CSeries
itself, and arg_threshold is specified as an argument to the
aggregation function or filter() function.
Returns the threshold or None.
col(name)
Expression representing a column of a DataFrame, similar to the polars function
concat(tables_, *, ignore_index=True, axis=0, join='outer', **query_args)
Table concatenation.
Performs horizontal/vertical concatenation of tables, modelled on pandas
pd.concat. Currently, only inner joins are supported for vertical concatenation.
The first table defines the set of columns that the
resulting table has. If join="inner", only columns common to all tables
are included. Else, the remaining tables need to have the same set of
columns as the first table (up to ordering), else an error is returned.
| PARAMETER | DESCRIPTION |
|---|---|
tables_
|
One or more DataFrames to be concatenated
TYPE:
|
ignore_index
|
does nothing, but is used in crandas.append, by default True
TYPE:
|
axis
|
Concatenation axis, 0=vertical, 1=horizontal, by default 0
TYPE:
|
join
|
type of join (currently only inner join is supported for vertical join), by default "outer"
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
mode-dependent return table representing vertical/horizontal join |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
Received wrong inputs |
NotImplementedError
|
Limited vertical concatenation is allowed, there must be a matching column on both tables to be concatenated |
ValueError
|
Limited vertical concatenation is allowed, number of columns should be the same in all tables |
RuntimeError
|
Horizontal join would create table with duplicate column names |
cut(series, bins, *, labels, right=True, add_inf=False)
Bin values into discrete intervals (aka quantization)
Bins values into discrete intervals, a la pandas.cut. Quantizes Series
into bins [bins[0],bins[1]), [bins[1],bins[2]), etc, and returns the
corresponding bin labels (so labels[0] for bin [bins[0],bins[1]),
labels[1] for bin [bins[1],bins[2]), etc. The bins include the left edge
and exclude the right edge.
The first bin should have -np.inf as left edge and the last bin should
have np.inf as its right edge. If the argument add_inf is set to
true, these edges are automatically added and do not need to be given
as arguments.
The bins and labels can be given in the plain (e.g.,
cd.cut(cdf["col"], [-np.inf, 0, 10, np.inf], [1, 2, 3]), or as columns
providing respective bins and labels for the respective input rows
(e.g., cd.cut(cdf["col"], cd["bins"], cd["labels"])). In the latter
case, the argument add_inf=True should be given.
| PARAMETER | DESCRIPTION |
|---|---|
series
|
series to apply quantization to
TYPE:
|
bins
|
list of integers or int_vec column defining the bin edges
TYPE:
|
labels
|
list of integers or int_vec column defining the bin labels (one more element than
TYPE:
|
right
|
specifies whether bins include their right edges
TYPE:
|
add_inf
|
when set to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeriesFun
|
representing the result of the quantization |
demo_table(number_of_rows=1, number_of_columns=1, **query_args)
Create demo table.
Creates a demo table with the given number of rows and columns. The columns are respectively named "col1", "col2", ... and have sequential integer values 1, 2, ...
A nonce is included in the command so that every time this command is called, it receives a fresh table handle.
| PARAMETER | DESCRIPTION |
|---|---|
number_of_rows
|
Number of rows of resulting table, by default 1
TYPE:
|
number_of_columns
|
Number of columns of resulting table, by default 1
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A demo table with a fresh name |
get_table(handle_or_name=None, /, *, dummy_handle=None, prod_handle=None, schema=None, check=True, map_dummy_handles=None, session=None, **query_args)
Access a previously uploaded table by its handle or name.
The previously uploaded table is specified using the handle_or_name
argument. Alternatively, both dummy_handle and prod_handle can be
specified. In that case, no changes have to be made between recording and
executing a script.
When get_table is called with a schema argument and/or from a
recorded script, the retrieved table is checked against the schema that
was specified or that was used when recording. If this schema check fails,
a ValueError is raised. This check can be disabled by passing
check=False.
Note that a name argument, if given, is interpreted as being part of the
standard query arguments, and is thus interpreted as a target name for
the result of the get query. Accordingly, get_table("a", name="b") can be
used to assign the (additional) symbolic name "b" to the table with name
or handle "a".
| PARAMETER | DESCRIPTION |
|---|---|
handle_or_name
|
Handle (hex-encoded string) or name. Gets interpreted as a handle if it is a 64 hexadecimal (uppercase) string, otherwise as a name.
TYPE:
|
dummy_handle
|
Handle (hex-encoded string) to be used in design. Used when either recording a script or when no script is active.
TYPE:
|
prod_handle
|
Handle (hex-encoded string) to be used in production. Used when executing an approved script.
TYPE:
|
schema
|
Represents the structure of the table to be added. Needed if get_table is called from a Transaction, or if it is desired to check that the table corresponds to the given schema, by default None. The schema can be specified as:
TYPE:
|
check
|
Enables server-side validation of the object type and schema (if given). In scripts, if
TYPE:
|
map_dummy_handles
|
Whenever a script is being recorded (see This behavior can be overridden in two levels: for the entire script or
for a single call to
TYPE:
|
query_args
|
See Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The table with handle |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
|
merge(left, right, how='inner', on=None, left_on=None, right_on=None, validate=None, suffixes=('_x', '_y'), session=None, **query_args)
Merge tables using a database-style join. Implements pandas.merge.
The following types of merge are supported:
- inner join: returns only the rows where the join columns match
- outer join: returns rows from both tables, matched where possible, starting with rows of left table in their original order
- left join: return rows of left table in original order, matched with a row of the right table where possible
- right join: return rows of right table, matched with a row of the left table where possible
Columns to join on are given either by a common on argument, or separate
left_on and right_on arguments for the left and right tables.
Depending on the arguments passed, the join columns can contain duplicates:
- if
on/left_on/right_onare column name(s), a one-to-one merge is performed, and an error is given if the merge columns contain duplicates; - if a
.groupby()result is given asleft_onargument, a many-to-one merge is performed where the left table can have duplicates. In this case, the servers learn the number of unique matching keys. - if a
.groupby()result is given asright_onargument, a one-to-many merge is performed where the right table can have duplicates. In this case, the servers learn the number of unique matching keys. - many-to-many merges (where both tables contain duplicate values for the
merge columns) are only supported with some leakage of information about
the underlying data; see [
crandas.caution.merge_m2m()][crandas.caution.merge_m2m].
In contrast to pandas, keys with null values do not match with each other and are not included in the result.
| PARAMETER | DESCRIPTION |
|---|---|
left
|
Left table to be joined
TYPE:
|
right
|
Right table to be joined
TYPE:
|
how
|
Type of join
TYPE:
|
on
|
Column(s) to join on; must be common to both tables
TYPE:
|
left_on
|
Column(s) of the left table to join on
TYPE:
|
right_on
|
Column(s) of the right table to join on, by default None
TYPE:
|
validate
|
Can be
TYPE:
|
query_args
|
engine query arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Result of the merging operation |
| RAISES | DESCRIPTION |
|---|---|
MergeError
|
Values of the join columns are not unique |
ValueError
|
Incorrect combination of arguments |
pandas_dataframe_schema(df, ctype=None, auto_bounds=None, schema=None)
Determine schema for pandas DataFrame
Tries to encode the given data, and returns schema of crandas DataFrame that
would result from calling upload_pandas_dataframe() with
the passed ctype and schema. See Ctype Schemas.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
DataFrame from which to generate the schema
TYPE:
|
ctype
|
DEFAULT:
|
auto_bounds
|
DEFAULT:
|
schema
|
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
dict(schema)
|
|
see [Ctype Schemas][ctypes-schemas]
|
|
read_csv(file_name, **kwargs)
Upload the given CSV file to the engine
Internally calls pd.read_csv(file_name, ...) to read the CSV file and calls
upload_pandas_dataframe() on the resulting DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
file_name
|
name of the file
TYPE:
|
**kwargs
|
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
uploaded table |
read_csv_schema(file_name, ctype=None, auto_bounds=None, schema=None, **kwargs)
Determine schema for CSV file
Tries to load the CSV and encode it for use by crandas. Returns schema of crandas DataFrame that would result from calling crandas.crandas.read_csv. See Ctype Schemas.
| PARAMETER | DESCRIPTION |
|---|---|
file_name
|
argument to
TYPE:
|
ctype
|
TYPE:
|
auto_bounds
|
TYPE:
|
schema
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
schema; see Ctype Schemas |
read_parquet(file_name, **kwargs)
Upload the given Apache Parquet file to the engine
If file_name is a pyarrow.Table, calls file_name.to_pandas(...)
and then applies upload_pandas_dataframe() to the result. Otherwise,
call ParquetFile(file_name, ...), and then apply
crandas.crandas.upload_streaming_file to the result. Keyword arguments **kwargs
can be provided for any of these called functions.
| PARAMETER | DESCRIPTION |
|---|---|
file_name
|
name of the file or path to the file to be read
TYPE:
|
**kwargs
|
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
uploaded table |
remove_objects(objects, **query_args)
Remove objects from server
If the list of objects contained a non-existent object, an error is raised. Still, all objects from the list are removed.
| PARAMETER | DESCRIPTION |
|---|---|
objects
|
Objects to be removed
|
query_args
|
See Query Arguments
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
EngineError
|
Some of the given objects did not exist |
series_max(col1, col2)
series_min(col1, col2)
upload_pandas_dataframe(df, ctype=None, *, auto_bounds=None, schema=None, **query_args)
Uploads an existing pandas DataFrame into the engine
| PARAMETER | DESCRIPTION |
|---|---|
df
|
DataFrame to upload
TYPE:
|
ctype
|
ctypes for columns; see Ctype Schemas
TYPE:
|
auto_bounds
|
See Query Arguments
TYPE:
|
schema
|
schema for uploaded data; see Ctype Schemas
DEFAULT:
|
**query_args
|
see Query Arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
the uploaded DataFrame |
upload_streaming_file(file_reader, file_type, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)
Uploads a file into the engine. Instead of loading the entire file, a file reader is used to load the table columnwise when uploading. Columns are then deleted, using considerably less memory. Currently only works for parquet files.
| PARAMETER | DESCRIPTION |
|---|---|
file_reader
|
file to upload
TYPE:
|
file_type
|
a string describing the type of file of file_reader
TYPE:
|
ctype
|
explicitly given types for columns
DEFAULT:
|
auto_bounds
|
See Query Arguments
TYPE:
|
schema
|
schema for uploaded data; see Ctype Schemas
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
the uploaded DataFrame |
vec_from_columns(series)
Combine columns into a single vector column
Combines the multiple numeric columns into a single vector column containing the input columns. For example:
tab = cd.DataFrame({"a": [1, 2], "b": [3, 4]})
tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
tab["vec"].open() == [[1, 3], [2, 4]]
tab = cd.DataFrame({"a": [1.1, 2.1], "b": [3.2, 4.2]})
tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
tab["vec"].open() == [[1.1, 3.2], [2.1, 4.2]]
tab = cd.DataFrame({"a": [1.1, 2.1], "b": [3, 4]})
tab["vec"] = cd.vec_from_columns([tab["a"], tab["b"]])
tab["vec"].open() == [[1.1, 3.0], [2.1, 4.0]]
| PARAMETER | DESCRIPTION |
|---|---|
series
|
List of non-vector numeric series.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeriesFun
|
Series containing the vector column. |
when(*predicates, **constraints)
Polars-style when-then-otherwise expression
The expression cd.when(<condition>).then(<value-if-condition>).otherwise(<value-otherwise>)
returns a CSeries for assigning, filtering, etc, allowing to select values in a
convenient way.
For example: cdf = cdf.assign(val = cd.when(cdf["x"]>0).then(cdf["x"]).otherwise(-cdf["x"]))
selects cdf["x"] if cdf["x"]>0 and -cdf["x"] otherwise, i.e., computes the absolute
value.
If multiple predicates are given, they all need to be satisfied. Constraints represent equality
checks, e.g., cd.when(x=0) is equivalent to cd.when(x==0).
Multiple .when().then() statements can be chained, optionally followed by an
.otherwise(<value if all conditions are false>) statement. For example, consider
cd.when(<condition1>).then(<value1>).when(<condition2>).then(<value2>).otherwise(<value3>).
In this case, if <condition1> holds, then <value1> is selected. If <condition1>
does not hold but <condition2> does, then <value2> is selected. If neither condition
holds, <value3> is selected.
If no .otherwise(...) is given, None is assumed.