crandas.crandas¶
Main crandas functionality: dataframes (CDataFrame), series (CSeries), and analysis operations (e.g., merge())
- class crandas.crandas.CDataFrame(columns, nrows=None, session=None, **kwargs)¶
Bases:
StateObject
Dataframe stored in the VDL CDataFrame provides access to tables stored in the VDL using an API modeled upon Panda’s `DataFrame`s. A CDataFrame may be obtained in one of the following ways:
By uploading data into the VDL using read_csv() or upload_pandas_dataframe()
By accessing an earlier uploaded table using get_table()
- __getitem__(key)¶
Implements df[key]
If key is a CSeries or a function, call CDataFrame.filter.
If key is a list, call CDataFrame.project
If key is a str, return a CSeries representing the column with the given name
If key is a slice, call CDataFrame.slice
- Raises:
TypeError – the key must be one of the accepted types
- add_prefix(prefix)¶
Implements pandas.DataFrame.add_prefix
- Parameters:
prefix (str) – prefix to be added
- Returns:
Copy of CDataFrame where prefixis added to all column names
- Return type:
- add_suffix(suffix)¶
Implements pandas.DataFrame.add_suffix
- Parameters:
suffix (str) – suffix to be added
- Returns:
Copy of CDataFrame where suffix is added to all column names
- Return type:
- append(other, ignore_index=True)¶
Implements pandas.DataFrame.append by calling crandas.concat accordingly TODO: To be deprecated
- Parameters:
other (DataFrame or CDataFrame) – The data to append.
ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1., currently only True is allowed, by default True
- Returns:
The concatenated table
- Return type:
- assign(query_args=None, **assignments)¶
Implements pandas.DataFrame.assign. Assigns new columns to a CDataFrame, and outputs a new CDataFrame with the new columns.
Assigned values need to be CSeries (or callable providing a CSeries), assignment of clear Series/scalar/array not supported.
To pass query arguments such as the target table name, specify them as values of the query_args dict, e.g.:
cdf.assign(newcol=1, query_args={“name”: “tablename”})
To create a column whose name could be a VDL query argument, add a query_args argument to disambiguate, e.g.:
cdf.assign(name=”Column value”, query_args={})
- astype(ctype=None, *, schema=None, validate=False, **query_args)¶
Converts
CDataFrame
to specified ctype or schemaIf a ctype is given, convert the respective columns to the respective types, keeping other columns intact. If a schema is given, ensure that the resulting table conforms to the schema in terms of columns, orders, and types (dropping columns from the column if they do not occur in the schema).
- Parameters:
ctype (dict or str, optional) – ctypes for columns; see Using ctypes and schemas, by default None
schema (dict, optional) – schema for uploaded data; see Using ctypes and schemas, by default None
validate (bool, optional) – whether to validate the conversion, by default False
query_args (See Query Arguments)
- Returns:
CDataFrame
with the set ctypes and columns- Return type:
- Raises:
ValueError – Provide ctype or schema, but not both
ValueError – Need to specify ctype or schema
- property ctype¶
Ctype for the table; see Using ctypes and schemas
- describe()¶
Generate descriptive statistics
- dropna(axis=0, **query_args)¶
Remove null values from a CDataFrame. The axis determines whether rows (0) or columns (1) are removed. Only rows is implemented. Returns a CDataFrame with no nullable columns or values.
- Parameters:
axis (int/string, optional) – determines whether rows (0) or columns (1) are deleted, by default 0. Only row method is allowed
query_args – See Query Arguments
- Returns:
The original
CDataFrame
with all rows or columns (depending onaxis
) with a null value removed- Return type:
- Raises:
NotImplementedError – Dropping columns is not implemented for structural reason
ValueError – Error whenever a wrong axis is entered
- fillna(value, **query_args)¶
Fill NULL values
Replace NULL values of nullable columns as specified by the value argument. The resulting columns are not nullable anymore.
If value is a valid argument to CSeries.fillna (e.g., an integer, a column, or a function), then this argument is applied to fill in NULLs in all nullable columns of the table (and so value needs to be of the correct type for all nullable columns).
If value is a dict, then the respective dictionary values are provided (as above) to fill in NULLs for the respective column.
If value is a CDataFrame, then the respective columns of value are used to fill in NULLs for the corresponding columns of self. This is done only for columns that occur in both CDataFrames.
- Parameters:
value (value/dict/CDataFrame) – Values to fill in for NULLs (see above)
query_args – See Query Arguments
- Returns:
Copy of table with NULLs replaced as indicated
- Return type:
- filter(key, threshold=None, **query_args)¶
Filter table
Returns table with all rows of the original table satisfying the criterion represented by
key
.key
can be aCSeries
representing a table column or a computation on table row(s). In this case, theCSeries
values need to be 1 (indicating that the corresponding row will be selected) or 0 (indicating that the row will not be selected).If the
CSeries
used for indexing has a threshold (seeCSeries.with_threshold
), the filtered result is only returned if it has the minimum number of rows as indicated by the threshold.Alternatively,
key
can be a function to be applied to the table columns. The function is called with one argument representing the table, of which the fields correspond to the columns. E.g., keylambda x: x.col1==1
represents the function that checks whether the value of column with namecol1
equals one.See
function_to_json
for more information.- Parameters:
key (CSeries or callable) – Filter criterion
threshold (int, optional, default: None) – If given, sets a minimum amount of rows that the resulting table needs to have; otherwise, the server returns an error. Equivalent to calling filter with
key.with_threshold(threshold)
query_args – See Query Arguments
- Returns:
The filtered table
- Return type:
- groupby(cols, *, bypass_same_value_check=False, **query_args)¶
Computes a grouping of the table by the values of (a) given column(s). Returns a grouping object that can be used in aggregation (see CSeriesGroupBy) or as an argument to crandas.merge().
Currently, a maximum of around 100 000 unique values are supported. Above that, the groupby will fail and give an error message. Note that this is the number of unique values. The number of rows can be significantly higher as long as there are less than 100 000 different values in the groupby column(s).
- Parameters:
cols (str iterable or str or
groupby.CDataFrameGroupBy
) – If a string iterable is given, use these as the name of the columns to group by. If a string is given, use this as the name of the single column to group by. If agroupby.CDataFrameGroupBy
object (i.e., a result of an earlier groupby operation) is given, re-use that grouping as a grouping of the current table. (The current table needs to have a column with the given name and the same values as the table for which the grouping was originally made.)bypass_same_value_check (bool, default: False) – Unless set, if the given cols is a
groupby.CDataFrameGroupBy
instance made from an other table, it is checked if that columns are the same for the current and the other table. Bypassing this check may avoid problems where the check cannot be performed for large values.query_args – See Query Arguments
- Returns:
Grouping object
- Return type:
- max(axis=0)¶
Computes the maximum of each (numeric) column
- Parameters:
axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0
- min(axis=0)¶
Computes the minimum of each (numeric) column
- Parameters:
axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0
- project(cols, **query_args)¶
Project table
Returns table with same rows but a selection of columns
- Parameters:
cols (list of str) – Columns to select. Can be empty. Columns can occur multiple times.
query_args – See Query Arguments
- Returns:
The projected table
- Return type:
- rename(columns, **query_args)¶
Implements pandas.DataFrame.rename Only renaming of columns via columns argument is supported.
- Parameters:
columns (dict) – dictionary of columns to be renamed of the form {“oldname”: “newname”}
query_args – See Query Arguments
- Returns:
CDataFrame with updated column names
- Return type:
- sample(*, n=None, frac=None, random_state=None, **query_args)¶
Samples rows from the dataframe.
The number of rows can be specified either as an integer n or a fraction frac. The case frac==1 corresponds to returning a shuffling of the table and is equivalent to CDataFrame.shuffle.
If a random_state is given, the sampling is performed in a deterministic way and according to a public selection (i.e., known to the servers and predictable to the client); otherwise, the sampling is non-deterministic and private (not known to the client and servers). See also CDataFrame.shuffle.
- Parameters:
n (integer, default: None) – Number of rows to sample
frac (floating-point, default: None) – Proportion of rows (between 0.0 and 1.0, inclusive) to sample
random_state (long integer, default: None) – Seed for deterministic sampling (otherwise is non-deterministic)
query_args – See Query Arguments
- Returns:
ret – Copy of the table with rows sampled
- Return type:
- property schema¶
Schema for the table; see Using ctypes and schemas
- shuffle(*, random_state=None, **query_args)¶
Return table with rows shuffled. If a random_state is given, the shuffle is determinstic and performed according to a public permutation (i.e., known to the servers and predictable to the client); otherwise, the shuffle is non-deterministic and private (not known to the client and servers).
- Parameters:
random_state (long integer, default: None) – Seed for deterministic shuffle (otherwise is non-deterministic)
query_args – See Query Arguments
- Returns:
ret – Copy of the table with rows shuffled
- Return type:
- slice(key, **query_args)¶
Slice table
Returns table with same columns but a selection of rows
- Parameters:
key (slice) – Python slice object representing rows to select
query_args – See Query Arguments
- Returns:
The sliced table
- Return type:
- sort_values(by, **query_args)¶
Sorts the dataframe according to the values in the column by. Currently, sorting on strings is not supported.
- Parameters:
by (str) – The column to sort on
query_args – See Query Arguments
- Returns:
ret – Copy of the sorted table
- Return type:
- validate(*validations, **query_args)¶
Applies input validation to the table.
Input validation leads to a table that has the validations as constraints on the respective columns (e.g., checking that a column contains values in [0,2] leads to a column with values constrained to that domain). These constraints can be inspected by accessing tab.columns.cols[i].constaints.
Validations are instances of the
crandas.Validation
class and can be set by calling validation functions such ascrandas.CSeriesColRef.in_range()
andcrandas.CSeriesColRef.sum_in_range()
.- Parameters:
*validations (list of
crandas.Validation
objects) – Validations to apply to the tablequery_args – See Query Arguments
- Returns:
If all validations have succeeded: copy of the table having the validations as constraints
- Return type:
- class crandas.crandas.CIndex(cols, **kwargs)¶
Bases:
object
Index (set of columns) of a CDataFrame
For a regular CDataFrame, this represents the columns (name and type) of the CDataFrame.
For a deferred CDataFrame (in a transaction, or resulting from a dry run), this represents the columns (name and type) that the result of an operation is expected to have based on its inputs. For such an expected column, the name is set, but the type and size (“elements per value”) may be undefined.
- __eq__(other)¶
Checks equality with input
- __getitem__(ix)¶
Returns name of column ix
- __len__()¶
Returns number of columns
- __repr__()¶
Returns printable representation
- property ctype¶
Ctype corresponding to the CIndex; see Using ctypes and schemas
- get_loc(name)¶
Get integer location for requested label
- Parameters:
name (str) – column name label;
- Returns:
index of column with name name
- Return type:
int
- Raises:
KeyError – value not found
- matches_template(expected)¶
Checks whether the number and names of columns fit a template
- property schema¶
Schema corresponding to the CIndex; see Using ctypes and schemas
- to_dict()¶
Returns column names in dictionary form
- class crandas.crandas.CSeries(**kwargs)¶
Bases:
Summable
One dimensional array which represents either the column of a CDataFrame or the result of applying a rowwise function to one or more columns of a CDataFrame
- class DT(outer_instance)¶
Bases:
object
Used to retrieve date units in the pandas way
- as_table(*, column_name='', **query_args)¶
Outputs a
CDataFrame
that has theCSeries
as its only column- Parameters:
column_name (str, optional) – name for the column in the resulting CDataFrame
query_args – See Query Arguments
- Returns:
CDataFrame having the expected CSeries as its only column
- Return type:
- as_value(**query_args)¶
Interpret single-row CSeries as value
Interpret single-row CSeries as a constant. For example, without using
as_value
,a["col2"]+b["col2"]
performs row-wise addition. Withas_value
, a single-row CSeries is interpreted as a single value instead of as a column, e.g.,a["col2"]+b["col2"].as_value()
adds the value of the rowcol2
of tableb
to each row ofa["col2"]
.This function can be used to work with values that remain secret to the servers that perform the computation, e.g.:
data = cd.DataFrame({"a": [1, 2, 3]}) # The value "1" is a part of the function definition and so becomes known # to the servers data[data["a"] == 1] # The value "1" is derived from a single-row column of a private table and # so remains hidden to the servers filtervalue = cd.DataFrame({"filtervalue": [1]})["filtervalue"].as_value() data[data["a"] == filtervalue]
- Parameters:
query_args – See Query Arguments
- Returns:
ReturnValue representing the value of the (only row of the) CSeries
- Return type:
- astype(ctype, validate=False)¶
Converts output to a specific type
- Parameters:
ctype (Ctypes type specification) – Type to convert to. See Data types
validate (bool, default False) – If set, validate that the resulting column is of the correct type, e.g., is an 8-bit integer when
tp=uint8
.
- Returns:
CSeries converted to given type
- Return type:
- Raises:
ServerError – Conversion failed or not supported
- bytes_to_hex()¶
Converts a bytes column to a lowercase hex string (like “3ec9”)
- capitalize()¶
Returns string with the first character converted to uppercase and all other ones to lowercase
- contains(other)¶
Substring search
Searches for substring in the column
- Parameters:
other (
crandas.CSeries
) – Substring to search for- Returns:
Result of search:
1
if substring is found,0
otherwise- Return type:
- day()¶
Returns the day of the month
- Returns:
day of the month
- Return type:
int
- day_of_year()¶
Returns the day of the year
- Returns:
Number representing the day of the year
- Return type:
int
- dayofyear()¶
Returns the day of the year
- Returns:
Number representing the day of the year
- Return type:
int
- encode()¶
Encodes an ASCII varchar column as a bytes column
- fillna(nullval)¶
Replaces NULL values in the column by nullval
- Parameters:
nullval (rowwise function) – Value to replace NULLs by
- fullmatch(pattern, *args)¶
Regular expression matching
Matches column to a regular expression.
- Parameters:
pattern (
re.Re
) – Regular expression to matchargs (list of
crandas.CSeries
) – Additional columns for the match (can be referred to by(?1)
to(?9)
in the regular expression, e.g.,r".*(?1).*"
)
- Returns:
Column containing the result of the matching:
1
if there is a match and0
otherwise.- Return type:
- get(*, name='', **query_args)¶
Deprecated. Use
CSeries.as_table()
instead.
- hex_to_bytes()¶
Converts a lowercase hex string (like “3ec9”) as a bytes column
- if_else(ifval, elseval)¶
Allows values to be assigned with an if-else statement where self is the guard and has to be a column of bits; the value from ifval is selected for rows of self that have the value one and the value from elseval is selected for rows of self that have the value zero
- Parameters:
ifval (int) – Value if true
elseval (int) – Value otherwise
- inner(other)¶
Inner product of two vectors
- isna()¶
Returns whether respective values are NULL, boolean inverse of notna
- isnull()¶
Returns whether respective values are NULL, boolean inverse of notna
- len()¶
Returns the character length of each element of the CSeries (only works for Cseries of type string or bytes)
- Returns:
CSeries of character lengths (for string) or number of bytes (for bytes)
- Return type:
- lower(indices=None)¶
Returns string values in lowercase
- Parameters:
indices (int list, optional) – Represents the letters to be modified, if
None
then the whole string is modified- Returns:
CSeries with lowercase strings
- Return type:
- Raises:
ValueError – Invalid index for string length
- month()¶
Returns the month in number format
- Returns:
month
- Return type:
int
- notna()¶
Returns whether respective values are not NULL, boolean inverse of isna
- notnull()¶
Alias for isna
- open()¶
Returns column in opened form
- strip()¶
Returns stripped string values
- substitute(sub_dict, output_size=None)¶
Performs string substitution in a string column. Inputs are provided in a dictionary of the form “a”: [“á”, “à”, “ä”] where the characters in the list [“á”, “à”, “ä”] will be substituted by the key “a”. Substitution of substrings of more than one character is not currently supported.
- Parameters:
sub_dict (str Dictionary) – Dictionary where each key is the string (maybe be more than one character) to be added and each value is a list of characters to be substituted by the key
output_size (int, optional) – new max string length of the column, necessary when substituting a character for multiple ones, by default None
- Returns:
CSeries with modified strings
- Return type:
- Raises:
TypeError – Values of sub_dict in substitute need to be a list of characters
TypeError – Every character should match at most one substitution.
- upper(indices=None)¶
Returns string values in uppercase
- Parameters:
indices (int list, optional) – Represents the letters to be modified, if
None
then the whole string is modified- Returns:
CSeries with uppercase strings
- Return type:
- Raises:
ValueError – Invalid index for string length
- vsum()¶
Sum the elements of a vector
- weekday()¶
Returns the day of the week, where Monday is 0
- Returns:
Number representing the day of the week
- Return type:
int
- with_threshold(threshold)¶
Adds a threshold to the CSeries. When the column is used as a filtering column or in an aggregation operation, this threshold indicates the minimum number of items that need to be in the filtering result or have the aggregation taken over.
- Parameters:
threshold (int) – minimum number of elements for operation to be allowed
- year()¶
Returns the year
- Returns:
year in 4 digits
- Return type:
int
- class crandas.crandas.CSeriesColRef(table, name, **kwargs)¶
Bases:
CSeries
Column of CDataFrame
Subclass of CSeries. Represents a column of a CDataFrame df as accesed via
df["colname"]
orlambda x: x.colname
.- as_table(*, column_name='', **query_args)¶
Outputs a
CDataFrame
that has theCSeries
as its only column- Parameters:
column_name (str, optional) – name for the column in the resulting CDataFrame
query_args – See Query Arguments
- Returns:
CDataFrame having the expected CSeries as its only column
- Return type:
- count(*, as_table=False, threshold=None, **query_args)¶
Computes the number of not-NULL elements of the series
See
sum()
for a description of the arguments.
- property ctype¶
Ctype for the column; see Using ctypes and schemas
- in_range(minval, maxval)¶
Validation that column values lie in specified range
Apples to integer/integer vector columns only
- Parameters:
minval (int) – minimum (inclusive);
maxval (int) – maximum (inclusive)
- Returns:
Validator for use in CDataFrame.validate
- Return type:
- max(*, as_table=False, threshold=None, **query_args)¶
Computes the maximum of the series
See
sum()
for a description of the arguments.
- mean(*, as_table=False, threshold=None, **query_args)¶
Computes the mean of the elements of the series.
See
sum()
for a description of the arguments.
- min(*, as_table=False, threshold=None, **query_args)¶
Computes the minimum of the series
See
sum()
for a description of the arguments.
- property schema¶
Ctype for the column; see Using ctypes and schemas
- sum(*, as_table=False, threshold=None, **query_args)¶
Computes the sum of the elements of the series
- Parameters:
as_table (boolean, default: False) – if True, result is returned as DataFrame instead of value
threshold (int, default None) – if given, only return value as long as the number of not-NULL elements is above the minimum threshold of elements for the operation
query_args – See Query Arguments
- Returns:
Result of applicable type, depending on as_table and mode
- Return type:
int/Deferred/DataFrame/CDataFrame
- sum_in_range(minval, maxval)¶
Validation that sum of column values lies in specified range
Applies to integer/integer vector columns only
- Parameters:
minval (int) – minimum (inclusive);
maxval (int) – maximum (inclusive)
- Returns:
Validator for use in CDataFrame.validate
- Return type:
- class crandas.crandas.CSeriesFun(op, vals, args={}, **kwargs)¶
Bases:
CSeries
Subclass of CSeries over which a function was applied to it
- class crandas.crandas.Col(name, type='?', elperv=-1, nullable=False, constraints=None, modulus=None, _ctype=None, _schema=None, **kwargs)¶
Bases:
object
Represents the type of a column.
The type and elperv fields can be equal to “?” and -1, respectively, to indicate that these are not known (e.g., for colums in an expected specification to vdl_query).
- __eq__(other)¶
Checks structural equality between columns
- __repr__()¶
Returns printable representation
- property ctype¶
Ctype for the column; see Using ctypes and schemas
- renamed(name)¶
Return copy of the Col with a different name
- property schema¶
Schema for the column; see Using ctypes and schemas
- exception crandas.crandas.CtypeSpuriousColumnsWarning(*, columns, **kwargs)¶
Bases:
UserWarning
Warning that ctype contains spurious columns
For example, below, the ctype has a spurious column
b
:cd.DataFrame({"a": [1]}, ctype={"a": "int", "b": "varchar"})
- crandas.crandas.DataFrame(data=None, *args, ctype=None, schema=None, auto_bounds=None, **kwargs)¶
Creates a crandas dataframe.
This function calls the pandas DataFrame constructor
pd.DataFrame(data, *args, ...)
, and uploads the resulting table usingupload_pandas_dataframe()
.Further arguments apart from
data
are passed topd.DataFrame
orupload_pandas_dataframe
as appropriate. In particular,name
can be used to specify a name for the table, and andauto_bounds
can be used to disable arnings about automatically derived column bounds; see Query Arguments. Further, thectype
andschema
arguments can be used to define columns and their types; seeupload_pandas_dataframe()
.When specifying
data
as adict
ofSeries
arguments, it is possible to usecrandas.Series()
instead ofpd.Series
. This allows to specify the crandas ctype directly with the data.- Parameters:
data (any, default: None) –
Contents of the dataframe, passed on to
pd.DataFrame
.If
data
isNone
andschema
is given, an empty dataframe according to the given schema is given. Otherwise, an empty dataframe with no columns is created.ctype (dict or str, default {}) – ctypes for columns; see Using ctypes and schemas
schema (dict, default None) – schema for uploaded data; see Using ctypes and schemas
auto_bounds (bool, optional) – See Query Arguments
- Returns:
uploaded table
- Return type:
- exception crandas.crandas.DataSpuriousColumnsWarning(*, columns, **kwargs)¶
Bases:
UserWarning
Warning that data contains spurious columns
For example, below, the data has a spurious column
b
:cd.DataFrame({"a": [1], "b": [2]}, schema={"a": "int"})
- class crandas.crandas.ExpectDropResult(*, expected_len, **kwargs)¶
Bases:
ResponseHandler
Response handler for drop response with given expected length
- class crandas.crandas.ReturnValue(type, elperv, is_series, num_rows=None, modulus=-1, *, handle, **kwargs)¶
Bases:
StateObject
,CSeries
Represent a value or series of values computed by the VDL
Various VDL commands, e.g.,
CSeries.sum()
, return values or series of values, as opposed to returning a DataFrame. This class is the analogue of CDataFrame that is used to represent such remote values.A ReturnValue can be used as a
CSeries
, making it possible e.g. to filter on a value computed by the VDL without having to open it. For example, the following filters all maximum elements without revealing the maximum:tab[tab["col"]==tab["col"].max(mode="regular")]
.To obtain the value/series in the clear, call
ReturnValue.open()
. This returns a single value, unless.is_series
is set, in which case it returns a Pandas series, which needs to have.num_rows
rows if set.- get(**query_args)¶
Deprecated. Use
CSeries.as_table()
instead.
- open(**query_args)¶
Open value
- Parameters:
query_args – See Query Arguments
- Returns:
Value represented by remote object, see main class documentation
- Return type:
int/…/pd.Series
- crandas.crandas.Series(*args, ctype=None, **kwargs)¶
Similar to
pandas.Series
, but additionally allows the user to specify a ctypeFor example:
cd.DataFrame({"a": cd.Series([1,2,3], ctype="int16?")})
- class crandas.crandas.Validation(table, col, json_desc)¶
Bases:
object
Represents a validation that can be applied to a column.
Returned by functions like
crandas.CSeriesColRef.in_range()
, etc. Used as an argument tocrandas.CDataFrame.validate()
.- json_desc can contain a combination of the following keys and values:
- bounds ([int string, int string])
the lower and upper bounds of the data, represented as strings for arbitrary precision
- sum_bounds ([int string, int string])
the lower and upper bounds of the sum of an array entry, represented as strings for arbitrary precision
- precision (int)
fixed-point precision
- is_array (bool)
boolean representing whether the column is an array
If the column is of type
int
, it can have the following keys: bounds, sum_bounds, is_arrayIf the column is of type
fixed point
, it can have the following keys: bounds, is_array, precision
- crandas.crandas.choose_threshold(obj_threshold, arg_threshold)¶
Choose threshold where
obj_threshold
is given using thewith_threshold()
function (and hence, set in the CDataFrame or CSeries itself, andarg_threshold
is specified as an argument to the aggregation function or filter() function.Returns the threshold or None.
- crandas.crandas.concat(tables_, *, ignore_index=True, axis=0, join='outer', session=None, **query_args)¶
Table concatenation Performs horizontal/vertical concatenation of tables, modelled on pandas pd.concat. Currently, only inner joins are suported for vertical concatenation. The first table defines the set of columns that the resulting table has. If join=”inner”, only columns common to all tables are included. Else, the remaining tables need to have the same set of columns as the first table (up to ordering), else an error is returned.
- Parameters:
tables (list of CDataFrames) – One or more DataFrames to be concatenated
ignore_index (bool, optional) – does nothing, but is used in crandas.append, by default True
axis (int, optional) – Concatenation axis, 0=vertical, 1=horizontal, by default 0
join (str, optional) – type of join (currently only inner join is supported for vertical join), by default “outer”
query_args – See Query Arguments
- Returns:
mode-dependent return table representing vertical/horizontal join
- Return type:
- Raises:
RuntimeError – Received wrong inputs
NotImplementedError – Limited vertical concatenation is allowed, there must be a matching column on both tables to be concatenated
ValueError – Limited vertical concatenation is allowed, number of columns should be the same in all tables
RuntimeError – Horizontal join would create table with duplicate column names
- crandas.crandas.cut(series, bins, *, labels, right=True, add_inf=False)¶
Bin values into discrete intervals (aka quantization)
Bins values into discrete intervals, a la pandas.cut. Quantizes series into bins [bins[0],bins[1]), [bins[1],bins[2]), etc, and returns the corresponding bin labels (so labels[0] for bin [bins[0],bins[1]), labels[1] for bin [bins[1],bins[2]), etc. The bins include the left edge and exclude the right edge.
The first bin should have -np.inf as left edge and the last bin should have np.inf as its right edge. If the argument add_inf is set to true, these edges are automatically added and do not need to be given as arguments.
The bins and labels can be given in the plain (e.g., cd.cut(cdf[“col”], [-np.inf, 0, 10, np.inf], [1, 2, 3]), or as columns providing respective bins and labels for the respective input rows (e.g., cd.cut(cdf[“col”], cd[“bins”], cd[“labels”])). In the latter case, the argument add_inf=True should be given.
- Parameters:
series (CSeries) – series to apply quantization to
bins (int list, CSeries) – list of integers or int_vec column defining the bin edges
labels (int list) – list of integers or int_vec column defining the bin labels (one more element than
bins
ifadd_inf
isTrue
or one less otherwise)right (bool) – specifies whether bins include their right edges
add_inf (bool) – when set to
False
,bins
should include-np.inf
andnp.inf
; when set toTrue
they are automatically added. Can only be set toTrue
whenbins``is a ``CSeries
- Return type:
CSeriesFun representing the result of the quantization
- crandas.crandas.dataframe_to_command(df, ctype, *, auto_bounds, schema)¶
Turns DataFrame into a VDL “new” command. Wrapper for table_to_command. :meta private:
- crandas.crandas.demo_table(number_of_rows=1, number_of_columns=1, **query_args)¶
Create demo table.
Creates a demo table with the given number of rows and columns. The columns are respectively named “col1”, “col2”, … and have sequential integer values 1, 2, …
A nonce is included in the command so that every time this command is called, it receives a fresh table handle.
- Parameters:
number_of_rows (int, optional) – Number of rows of resulting table, by default 1
number_of_columns (int, optional) – Number of columns of resulting table, by default 1
query_args – See Query Arguments
- Returns:
A demo table with a fresh name
- Return type:
- crandas.crandas.get_table(handle_or_name, /, *, schema=None, check=True, map_dummy_handles=None, **query_args)¶
Access a previously uploaded table by its handle or name.
The previously uploaded table is specified using the handle_or_name argument.
When
get_table
is called with aschema
argument and/or from a recorded script, the retrieved table is checked against the schema that was specified or that was used when recording. If this schema check fails, aValueError
is raised. This check can be disabled by passingcheck=False
.Note that a name argument, if given, is interpreted as being part of the standard query arguments, and is thus interpreted as a target name for the result of the get query. Accordingly, get_table(“a”, name=”b”) can be used to assign the (additional) symbolic name “b” to the table with name or handle “a”.
- Parameters:
handle_or_name (str) – Handle (hex-encoded string) or name. Gets interpreted as a handle if it is a 64 hexadecimal (uppercase) string, otherwise as a name.
schema (optional, default None) –
Represents the structure of the table to be added. Needed if get_table is called from a Transaction, or if it is desired to check that the table corresponds to the given schema, by default None. The schema can be specified as:
schema dict (e.g.,
cdf.schema
; see Using ctypes and schemas)list of column names (e.g.,
["col1", "col2"]]
pandas DataFrame
any valid argument to
pandas.read_csv
CIndex (e.g.,
cdf.columns
)
check (bool (default:
True
)) –Enables server-side validation of the object type and schema (if given).
In scripts, if
check==True
, then the schema of the table is checked against the table that the script was recorded with. To disable this check, usecheck=False
during the script recording.map_dummy_handles (bool, optional) –
Whenever a script is being recorded (see
crandas.script
), the default behavior is to interpret all calls toget_table(handle)
asdummy_for:<handle>
table names. This allows the user to use the same handle in both script recording and execution, even though the script recording takes place in a different environment where the real table handle does not exist.This behavior can be overridden in two levels: for the entire script or for a single call to
get_table
. For the entire script, mapping dummy handles can be disabled by supplyingmap_dummy_handles
asFalse
in the call tocrandas.script.record()
. For the call toget_table
, by specifying this argument as either True or False, the mapping behavior is forced to be either enabled or disabled, regardless of the current script mode.query_args – See Query Arguments
- Returns:
The table with handle
handle_or_name
- Return type:
- Raises:
ValueError –
Schema validation failed - Schema not specified (when performed in a transaction)
- crandas.crandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, validate=None, suffixes=('_x', '_y'), **query_args)¶
Merge tables using a database-style join. Implements pandas.merge.
- The following types of merge are supported:
inner join: returns only the rows where the join columns match
outer join: returns rows from both tables, matched where possible, starting with rows of left table in their original order
left join: return rows of left table in original order, matched with a row of the right table where possible
right join: return rows of right table, matched with a row of the left table where possible
Columns to join on are given either by a common on argument, or separate left_on and right_on arguments for the left and right tables. By default, a one-to-one merge is performed, and an error is given if the values of the columns to merge on are not unique. It is also possible to perform a many-to-one merge or a one-to-many merge by specifying the result of a
CDataFrame.groupby()
on the join columns as the left_on or right_on argument, respectively. Many-to-many joins are not supported.In the case of a many-to-one or one-to-many merge, the servers learn the number of unique matching keys.
- Parameters:
left (CDataFrame) – Left table to be joined
right (CDataFrame) – Right table to be joined
how ("inner" (default), "outer", or "left", optional) – Type of join
on (str or list of str, optional) – Column(s) to join on; must be common to both tables
left_on (str, list of str, or CDataFrameGroupBy, optional) – Column(s) of the left table to join on
right_on (str or list of str, optional) – Column(s) of the right table to join on, by default None
validate (str, optional) – Can be “one_to_one”, “1:1”, “one_to_many”, “1:m”, “many_to_one” or “m:1”. If given, it is compared against the type of merged derived from the left_on and right_on arguments as discussed above, and an exception is raised if it is incorrect
query_args – VDL query arguments
- Returns:
Result of the merging operation
- Return type:
- Raises:
MergeError – Values of the join columns are not unique
ValueError – Incorrect combination of arguments
- crandas.crandas.pandas_dataframe_schema(df, ctype=None, auto_bounds=None, schema=None)¶
Determine schema for pandas DataFrame
Tries to encode the given data, and returns schema of crandas DataFrame that would result from calling
crandas.crandas.upload_pandas_dataframe()
with the passed ctype and schema. See Using ctypes and schemas.- Parameters:
df (pd.DataFrame) – DataFrame
ctype – see
crandas.crandas.upload_pandas_dataframe()
auto_bounds – see
crandas.crandas.upload_pandas_dataframe()
schema – see
crandas.crandas.upload_pandas_dataframe()
- Returns:
schema; see Using ctypes and schemas
- Return type:
dict
- crandas.crandas.parquet_to_command(pq_table, ctype, *, auto_bounds, schema)¶
Turns Parquet object into a VDL “new” command. Wrapper for table_to_command. :meta private:
- crandas.crandas.read_csv(file_name, **kwargs)¶
Upload the given CSV file to the VDL
Internally calls
pd.read_csv(file_name, ...)
to read the CSV file and callsupload_pandas_dataframe(...)()
on the resulting DataFrame.- Parameters:
file_name (str | PathLike) – name of the file
**kwargs (any) –
any keyword arguments to
pd.read_csv
any arguments to
upload_pandas_dataframe()
; in particular, this includes query arguments (see Query Arguments) such asdummy_for
for use during query design; andname
to assign a name to the table
- Returns:
uploaded table
- Return type:
- crandas.crandas.read_csv_schema(file_name, ctype=None, auto_bounds=None, schema=None, **kwargs)¶
Determine schema for CSV file
- Tries to load the CSV and encode it for use by crandas. Returns
schema of crandas DataFrame that would result from calling
crandas.crandas.read_csv()
. See Using ctypes and schemas.- Parameters:
file_name – argument to
pd.read_csv
ctype – see
crandas.crandas.upload_pandas_dataframe()
auto_bounds – see
crandas.crandas.upload_pandas_dataframe()
schema – see
crandas.crandas.upload_pandas_dataframe()
**kwargs (any) – any arguments to be forwarded to
pd.read_csv
- Returns:
schema; see Using ctypes and schemas
- Return type:
dict
- crandas.crandas.read_parquet(file_name, **query_args)¶
Upload the given Apache Parquet file to the VDL
- Parameters:
file_name (str | Path-like object | arrowTable) – name of the file or path to the file to be read
**kwargs (any) –
any keyword arguments to
pd.read_csv
any arguments to
upload_pandas_dataframe()
; in particular, this includes query arguments (see Query Arguments) such asdummy_for
for use during query design; andname
to assign a name to the table
- Returns:
uploaded table
- Return type:
- crandas.crandas.remove_objects(objects, **query_args)¶
Remove objects from server
If the list of objects contained a non-existent object, an error is raised. Still, all objects from the list are removed.
- Parameters:
objects (list of StateObject (CDataFrame, ...)) – Objects to be removed
query_args – See Query Arguments
- Raises:
ServerError – Some of the given objects did not exist
- crandas.crandas.series_max(col1, col2)¶
Compute the maximum of two CSeries
- crandas.crandas.series_min(col1, col2)¶
Compute the minumum of two CSeries
- crandas.crandas.upload_pandas_dataframe(df, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)¶
Uploads an existing pandas DataFrame into the VDL
- Parameters:
df (pandas.DataFrame) – DataFrame to upload
ctype (dict, default None) – ctypes for columns; see Using ctypes and schemas
session (cd.base.Session) – see Query Arguments
_keep (bool/None) – if False, table is not saved
auto_bounds (bool, optional) – See Query Arguments
schema (dict, default None) – schema for uploaded data; see Using ctypes and schemas
**query_args (query arguments) – see Query Arguments
:param **session** is an argument for an underlying function. For more information see Query Arguments.:
- Returns:
the uploaded DataFrame
- Return type:
- crandas.crandas.upload_streaming_file(file_reader, file_type, ctype=None, session=None, _keep=True, *, auto_bounds=None, schema=None, **query_args)¶
Uploads a file into the VDL. Instead of loading the entire file, a file reader is used to load the table columnwise when uploading. Columns are then deleted, using considerably less memory. Currently only works for parquet files.
- Parameters:
file_reader (pq.ParquetFile) – file to upload
file_type (str) – a string describing the type of file of file_reader
ctype (dict, default: None) – explicitly given types for columns
name (str, optional) – name for the table; passed on to upload_pandas_dataframe() if given, by default None
auto_bounds (bool, optional) – See Query Arguments
schema (dict, default None) – schema for uploaded data; see Using ctypes and schemas
- Returns:
the uploaded DataFrame
- Return type: