Changelog

v1.9.1

  • Fix a bug in the memory allocation strategy of the Docker image

v1.9.0

Major new features include:

  • Connect to the VDL through the connection file

  • Group by on strings and multiple columns

  • Ordinal logistic regression and k-nearest neighbors

  • Added stepless mode to scripts

  • Added sorting by column to CDataFrame with sort_values

  • Extended functionality for regular expression maching

New features

  • Connect to the VDL through the connection file

    It is now possible to connect to the VDL with a connection file. It contains the URL and certificate for one of the servers, and public keys for the 3 servers to encrypt the uploads. Such file ( by default default.vdlconn residing in the configuration folder ~/.config/crandas) contains the necessary information to connect to the VDL. The name default.vdlconn can be overriden through the default_connection_file variable. In standar deployments, the connection file shares the name with the environment.

    from crandas.base import session
    
    # connect your session to the VDL
    session.connect("<your-environment>")
    
  • Group by on strings and multiple columns

    It is now possible to perform group by operations and left joins on columns of any (non-nullable) type. It is also possible to perform a group by multiple columns. Instead of passing the name of a single column as an argument to CDataFrame.groupby(), a list of column names can be used.

    tab = cd.DataFrame({"a": [1, 2, 2, 2, 3, 3], "b": ["x", "x", "x", "y", "y", "y"]}, auto_bounds=True)
    
    # Perform group by on multiple columns, which can be of any non-nullable type
    grouping = tab.groupby(["a", "b"])
    
  • Ordinal logistic regression and k-nearest neighbors

    The crlearn suite of machine learning algoritms has been exteneded with ordinal logistic regression as well as k-nearest neighbors. Ordinal logistic regression has a response variable that can take one of many categories that are ordered. In order it to access it, simply add multi_class="ordinal" when constructing a LogisticRegression model.

    from crandas.crlearn.logistic_regression import LogisticRegression
    
    model = LogisticRegression(solver='lbfgs', multi_class="ordinal", n_classes=5)
    

    We also introduce a way to score this kind of regression, the confusion matrix. The confusion matrix visualizes the relation between the predicted classes and the actual classes.

    from crandas.crlearn.metrics import confusion_matrix
    
    matrix = confusion_matrix(y, y_pred_classes, n_classes=5)
    

    We have also introduced the k-nearest neighbors algorithm. K-nearest neighbors is a predictor algorithm that predicts based on the closest datapoints in the training data. Using it follows a similar structure as othe regression functionalities in crandas.

    import crandas as cd
    from crandas.crlearn.neighbors import KNeighborsRegressor
    X_train = cd.DataFrame({"input": [0, 1, 2, 3]}, auto_bounds=True)
    y_train = cd.DataFrame({"output": [0, 0, 1, 1]}, auto_bounds=True)
    X_test = cd.DataFrame({"input": [1]}, auto_bounds=True)
    neigh = KNeighborsRegressor(n_neighbors=3)
    neigh.fit(X_train, y_train)
    neigh.predict_value(X_test)
    

    More information can be found in the relevant section.

  • Added stepless mode to scripts

    We support stepless mode in scripts, that can be manually enabled to remove script_step numbers from certain queries. This can be useful together with the Any placeholder, to have queries that can be executed a variable number of times.

  • Added sorting by column to CDataFrame with sort_values

    We can now sort a dataframe according to a column.

    cdf = cd.DataFrame({"a": [3, 1, 4, 5, 2], "b": [1, 2, 3, 4, 5]}, auto_bounds=True)
    cdf = cdf.sort_values("a")
    

    Currently, sorting on strings is not supported.

  • Extended functionality for regular expression matching

    Regular expressions matching has been made more efficient through state-of-the-art improvements. Additionally, we have added support for the following operators in regular expressions: - {n}: match exactly n times - {min,}: match at least min times - {,max}: match at most max times - {min,max}: match at least min and at most max times

v1.8.0

Major new features include:

  • Support for bigger (96 bit) integers

  • Progress bars for running queries and the possibility of cancelling running queries

  • Memory usage improvements (client & server)

  • Null value (missing values) support for all column types

  • Searching strings using regular expressions

  • Added a date column type

New features

  • Support for columns with bigger (96 bit) integers

    Just like in the previous version, integers have the ctype int. When specifying the ctype, minimum and maximum bounds for the values can be supplied using the min and max parameters, e.g. int[min=0, max=1000]. Bounds (strictly) between -2^95 and 2^95 are now supported.

    For example, to upload a column "col": [1, 2, 3, 4] as an int use the following ctype_spec:

    table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": "int[min=1,max=4]"})
    

    as before.

    To force usage of a particular modulus the integer ctype accepts the keyword argument modulus. For example, to force usage of large integers one can run:

    from crandas.moduli import moduli
    table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})
    

    Notes: * crandas will automatically switch to int[modulus={moduli[128]}] if the (derived) bounds do not fit in an int32. * crandas will throw an error if the bounds do not fit in an int96.

    We refer to 32-bit integer columns as F64, and 96-bit integer columns as F128, because they are internally represented as 64 and 128 bits numbers, respectively, since we account for a necessary security margin.

    Supported features for large integers:

    • Basic binary arithmetic (+, -, *, ==, <, >, <=, >=) between any two integer columns

    • Groupby and filter on large integers

    • Unary functions on large integer columns, such as mean(), var(), sum(), ...

    • if_else where the 3 arguments guard, ifval, elseval may be any integer column

    • Conversion from 32-bit integer columns to large integer columns via astype and vice versa

    • Vertical concatenation of integer columns based on different moduli

    • Performing a join on columns based on different moduli

    Current limitations:

    • We do not yet support string conversion to large integers

    • json_to_val only allows integers up to int32 yet

    • IntegerList is only defined over F64 yet

  • Searching strings and regular expressions

    To search a string column for a particular substring, use the CSeries.contains function:

    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    only_is_rows = table["col"].contains("is")
    table[only_is_rows].open()
    

    Regular expressions are also supported, using the new CSeries.fullmatch function:

    import crandas.re
    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    starts_with_t = table["col"].fullmatch(cd.re.Re("t.*"))
    table[starts_with_t].open()
    

    Regular expressions support the following operations:

    • |: union

    • *: Kleene star (zero or or more)

    • +: one or more

    • ?: zero or one

    • .: any character (note that this also matches non-printable characters)

    • (, ): regexp grouping

    • [...]: set of characters (including character ranges, e.g., [A-Za-z])

    • \\d: digits (equivalent to [0-9])

    • \\s: whitespace (equivalent to [\\\\ \\t\\n\\r\\f\\v])

    • \\w: alphanumeric and underscore (equivalent to [a-zA-Z0-9_])

    • (?1), (?2), …: substring (given as additional argument to CSeries.fullmatch())

    Regular expressions are represented by the class crandas.re.Re. It uses pyformlang’s functionality under the hood.

  • Efficient text operations for ASCII strings

    The varchar ctype now has an ASCII mode for increased efficiency with strings that do only contain ASCII characters (no “special” characters; all codepoints <= 127). Before this change, we only supported general Unicode strings. Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings.

    By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This information (whether or not ASCII mode is used) becomes part of the public metadata of the column, and crandas will give a ColumnBoundDerivedWarning to indicate that the column metadata is derived from the data in the column, unless auto_bounds is set to True.

    Instead of auto-detection, it is also possible to explicitly specify the ctype varchar[ascii] or varchar[unicode], e.g.:

import crandas as cd

# ASCII autodetected: efficient operations available; warning given
cdf = cd.DataFrame({"a": ["string"]})

# Unicode autodetected: efficient operations not available; warning given
cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})

# ASCII annotated; efficient operations available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})

# Unicode annotated; efficient operations not available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})
  • Running computations can now be cancelled

    Locally aborting a computation (e.g. Ctrl+C) will now cause it to be cancelled on the server as well.

  • All column types now support missing values

    All ctypes now support a nullable flag, indicating that values may be missing. It may also be specified using a question mark, e.g. varchar?.

  • Progress reporting for long-running queries

    Queries that take at least 5 seconds now result in a progress bar being displayed that estimates the progress of the computation.

    To enable this for Jupyter notebooks, note that crandas should be installed with the notebook dependency flag, see below.

  • Various memory improvements for both server and client

  • Large data uploads and downloads are now automatically chunked

    Uploads are processed in batches of size crandas.ctypes.ENCODING_CHUNK_SIZE.

  • Added a date column type

    • Dates can now be encoded using the date ctype.

    • Dates limited between 1901/01/01 - 2099/12/31 for leap year reasons

    • Ability to subtract two dates to get number of days and add days to a date

    • All comparison operators apply for date

    • Added functions to extract year, month, day and weekday

    • Able to group over dates, merge and filter

  • feat(crandas): support with_threshold for aggregation

    This adds support for e.g. table["column"].with_threshold(10).sum(). Before this change, with_threshold() was only supported for filtering operations, e.g. table[filter.with_threshold(5)], and not for aggregation operations (min, max, sum, etc.).