1.14.0

Crandas

New features

  • Major performance improvements to fixed point division. Its performance is improved by around a factor 10.

  • Add support for random forests (see crandas.crlearn.ensemble.RandomForestClassifier).

  • Add conversion from non nullable column types to nullable column types.

  • Add save mode to StateObject.save() to only save an object if it does or does not exist.

  • Add progress reporting for unsafe many-to-many join.

  • Add support for groupby on nullable columns when using dropna = False.

  • Add function cd.get() to retrieve an object from the engine regardless of its type (to be used in place of cd.get_table, etc.).

  • Add col.any() and col.all() functions for boolean columns.

  • Add compatibility with NumPy versions 1 and 2

  • Add support for conversion of fixedpoint column types to fixedpoint column types with lower precision.

  • Add support for conversion of fixedpoint column types to integer column types.

  • Add support for non-string column names and constructing a crandas dataframe from a numpy array.

  • Add more powerful interface for machine learning models, including linear regression models, via the CModel class.

Changes

  • Allow to use **query_args in obj.remove().

  • Report size of data in CDataFrame in its Jupyter/command-line representation.

  • Improve dry-run mode such that table uploads and downloads can be simulated without needing a connection to the engine.

  • BREAKING Inner product function col1.inner(col2) requires that both col1 and col2 has equal elements_per_value.

  • BREAKING The inner product function on vectors col1.inner(col2) (e.g., ctypes int_vec or fp_vec) now requires that both col1 and col2 have vectors of equal length, and otherwise raises an error. The old behavior was to have an inner product of the first n entries of both vectors, where n is the smallest of the two vector lengths.

  • Get rid of ClientError in crandas, instead sending a relevant python error or the new ConnectionFileError or ModelError.

  • BREAKING Change ServerError to EngineError, adding new error codes and error structure.

  • When the computation of the variance in tab.describe() results in an overflow, return NaN instead of raising an exception.

  • Column information from table.columns is now better styled and reports data sizes per column.

  • Remove crandas.get2() that was only used for testing.

  • Remove support for slices with negative step sizes, which could give incorrect results.

Fixes

  • Fix parsing of fixed-point bounds of the form fp[min=0,max=1e+20] (with a + in the exponent) which were incorrectly rejected.

  • Fix performance issues of col.min() and col.max().

  • Fix bug where the repr/Jupyter representation of tables with unknown structure (e.g., from running get_table in a transaction or in dry-run mode) would throw an error.

  • Fix bug where, in dry-run mode, the length of the randomly generated handles would be incorrect.

  • Fix bug where, if a placeholder was used in a script execution but no placeholder was used in the script recording, mismatches would not be detected.

  • Fix equality check between a ReturnValue and a constant like col.sum(mode="regular") == 0. This interface allows to upload/download models; retrieve and set their parameters; and retrieve their public attributes.

  • Fix problem where performing a concat on a very large number of tables could result in I/O errors.

  • Fix incorrect slices on bytes columns in some cases.

  • Fix problem preventing uploading large nullable integers.

  • Fix bug where print_json query argument did not work outside of script recording.

  • Fix progress reporting for logistic regression.

  • Fix bug in function_to_json to compute correct ctype bounds.

1.13.0

Crandas

  • Improve progress reporting for multi-column groupby and sort_values.

  • Add integer to string conversion using int_col.astype("varchar").

  • cd.read_parquet now accepts keyword arguments get passed through to pyarrow.Table.to_pandas/ParquetFile

  • Add functions sum_squares(), mean(), var(), count(), min(), max() to work on the result of series functions. For example, the following now works:

    table = cd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, auto_bounds=True)
    y = (table["a"] * table["b"]).mean()
    

    Before these functions only worked on direct columns of a DataFrame. Note all of these functions take a mode keyword argument, which defaults to "open" (i.e. open the resulting value). To keep the result secret-shared, specify mode="regular".

    BREAKING: the sum() function was already implemented, but its mode argument is now keyword-only instead of positional. That is, (table["a"] * table["b"]).sum("regular") now needs to be specified as (table["a"] * table["b"]).sum(mode="regular").

  • Fix a problem where, in some Python configurations (e.g., from the Spyder Console), the time counter of the progress bar could stall

  • If while running a query, crandas loses connection to the server, it will retry instead of immediately raising an error (see wait_for_reconnect configuration option)

  • Make information shown in the progress bar cleaner

  • BREAKING: Allow to continue running script after NoMatchingAuthorizationError. Previously, in case of a mismatch during the use of a recorded script, it was not possible to fix the mismatch and continue using the script. This was because crandas moved on to try the next step of the script, even if a step failed. Now, crandas will keep trying the authorization for the failed step. It is still possible to manually skip a script step by using cd.script.current_script().skip().

  • Crandas can now connect to the same VDL server instance either in authorized or in design mode.

    • The mode is typically specified by the connection file and can be retrieved using cd.base.session.mode.

    • A warning is now given when connecting in design mode. This warning can be disabled using the warn_design setting.

    • BREAKING: crandas.tool.package_connection_file script now needs --design, --authorized, or --no-mode to indicate which mode to use

  • Add unsafe many-to-many join cd.unsafe.merge_m2m. This join function is unsafe because it leaks statistical information to the servers; see documentation for details. Use with care.

  • Removed support for bitlength query argument (deprecated since version 1.4.0)

  • When using crandas in Jupyter, outputting a CDataFrame now is better styled using HTML

  • Change the recommended extension of a script recording from .json to .recording.

  • Fix predict_proba to work on small numbers of rows (smaller than the number of features)

  • Fix predict and predict_proba functionality for ordinal logistic regression. Previously these functions could return incorrect results for this regression type.

  • Major improvements to fixed point multiplication. Its performance is improved by around a factor 10 with network usage similar to that of integer multiplication.

  • Fix bug where, if no connection file is found, crandas would give an error message saying that “there are multiple connection files present”

  • Provide more detailed error messages for missing authorization, including a high-level hint for some frequently occurring mismatches

  • BREAKING: The crandas.experimental.stats module has been marked as stable and was moved to crandas.stats. In addition:

    • The stats.rankdata() function was added

    • The stats.tiecorrect() function was added

    • The output of stats.chi2_contingency() was extended to also provide the field expected_freq, which contains the expected frequencies of the observations. The structure remains backwards-compatible otherwise. In addition, accuracy was improved.

    • The Kruskal-Wallis test stats.kruskal() was generalized to properly handle duplicates. The option allow_duplicates has been removed since it is no longer applicable. The function performs stricter input validation to ensure the test is well-defined.

    • The stats.chisquare() and stats.chi2_contingency() functions now perform stricter input validation to ensure the test is well-defined and can be evaluated accurately.

    • The stats.contingency.expected_freq() function performs stricter input validation to ensure it can be computed accurately. In addition, accuracy was improved.

1.12.3

1.12.2

Crandas

  • Fix a bug, where reducing the bounds (e.g., using astype) breaks the server when it tries to switch to a smaller internal representation.

  • Allow fixedpoint ctype specification to define min and max as floats. For example:

    table = cd.DataFrame({"vals": [0.75, 1.25, 2.25]}, ctype={"vals": "fp[min=0.5,max=2.5]"})
    
  • BREAKING: Statistics functionality is moved to the cd.experimental namespace because of potential inaccuracies for large inputs

  • Add cd.experimental.stats.contingency.crosstab for computing contingency tables

  • Add crandas.CSeries.as_series to obtain a ReturnValue for the series

  • Add crandas.experimental.stats.chi2_contingency for performing chisquare test on contingency table

  • Add crandas.experimental.stats.contingency.expected_freq for computing expected frequences

  • Fix crandas.experimental.stats.ttest_ind() to work with unequal variances for larger data

  • Fix crandas.experimental.stats.chisquare to provide better error message when input contains zeros

  • Added the crandas.compat, crandas.crypto and crandas.experimental.stats packages in pyproject.toml via automatic package discovery.

1.12.1

Crandas

  • Fix a bug in fixed-point multiplication; in certain cases this bug caused incorrect fixed-point multiplication results.

  • Fix an issue related to the printing of the min and max value of a fixed-point ctype

  • Fix bug where, in some cases, a schema mismatch between design and production would cause an exception rather than a user-friendly error message

1.12.0

Crandas

  • Major performance improvements for comparisons and related operations like sort_values and groupby. The performance of these operations is improved by around a factor 10.

  • Fix bugs in substitute in the following edge-cases:

    • Fix crash when the output_size parameter is set too high.

    • Fix potential invalid substitutions when Unicode substitutions are applied to ASCII strings.

  • Add support for fixed-point arrays, including array functions inner and vsum.

  • Add support for cd.cut on fixed-point values, bins and labels. For example:

    table = cd.DataFrame({"vals":[1.1, 2.1, 2.2, 3.3]}, ctype={"vals":"fp[precision=30]"})
    table = table.assign(cut=cd.cut(table["vals"], [1.5,2.5], labels=[0.1,0.2,0.3], add_inf=True))
    
  • Fix bug which prevented uploading bytes columns with more than 100 000 values.

  • Add support for Base64 encoding and decoding: bytes_col.b64encode() and string_col.b64decode():

    table = cd.DataFrame({"base64": ["TWFu", "bGlnaHQgd29yaw=="]}, auto_bounds=True)
    table = table.assign(bytes=table["base64"].b64decode()) # table["bytes"] set to [b"Man", b"light work"]
    table["bytes"].b64encode() # equal to table["base64"]
    
  • Add support for performing a groupby on more than 100 000 unique values.

  • Add support for computing the square root cdf["col"].sqrt()

  • Add module crandas.stats for performing a number of statistical tests

  • override_version_check can now be supplied to the Session constructor, so cd.connect("connection-file", override_version_check=False) now works. Additionally, the CRANDAS_OVERRIDE_VERSION_CHECK environment variable was moved to the cd.config framework.

  • Support mode and other query arguments for CSeries.sum(), e.g. so that the following now works without opening values

    table = cd.DataFrame({"values": [1, 2, 3]})
    x = (table["values"] ** 2).sum(mode="regular")
    # x is not opened
    
  • Improve support for using multiple sessions at the same time with the session query argument and the use of a session as a context handler

    BREAKING: A check is introduced to ensure that objects from different sessions are not mixed. (Previously, if multiple sessions to the same endpoint were used, it was possible to mix objects.)

1.11.0

Crandas

  • Preserve len(df) of pandas DataFrames without columns

  • Support for the concatenation of strings. For example:

    table = cd.DataFrame({"first_name": ["John", "Jan"], "last_name": ["Doe", "Jansen"]}, auto_bounds=True)
    full_names = table["first_name"] + " " + table["last_name"]
    full_names.open()
    
  • Add support for upper(), similar to the existing lower(). In addition, it is now also possible to only change the case of specific indices:

    table = cd.DataFrame({"name": ["john", "Jansen"]}, auto_bounds=True)
    table["name"].upper([0]).open() # Returns ["John", "Jansen"]
    table["name"].upper([1, 3, 5]).open() # Returns ["jOhN", "JAnSeN"]
    
  • crandas.tool.check_connection now cleans up tables after running a dummy computation

  • Perform parquet uploads through read_parquet

  • Improved implementation of cd.merge:

    • Add support for right and one-to-many joins. Now, any combination of left/right/inner/outer one-to-one/one-to-many/many-to-one joins is supported.

    • Improved support for nullable columns. When performing a non-inner join, now nullable columns are output (previously, a non-nullable column with the value zero was returned in some cases)

    • Deal with key columns with different colum names consistently with pandas, e.g., when joining with left_on="a" and right_on="b", return two columns with the left and right key values, respectively

    • Add support for suffixes

    • Improved progress reporting for one-to-many joins

    • Allow trivial grouping-based join where the right table only has key columns

    BREAKING: The signature of the cd.merge function has not changed, but, because of the above changes, the resulting table may have a different set of columns and/or columns of different types than in previous versions. Moreover, the underlying VDL API command has been renamed and changed internally so that existing authorizations do not apply to the new merge. The old merge is available via crandas as cd.compat.merge_v1.

  • Fix bug in bytes column when using non-ASCII characters. Opening such values could give incorrect results.

  • Add support for (fixed-point) division (/). For example:

    table = cd.DataFrame({"num": [1.2, 3.4, 5.6], "denom": [2.1, 4.3, 5.4]}, auto_bounds=True)
    table = table.assign(div = lambda x: x.num / x.denom)
    table = table.assign(rec_num = lambda x: 1 / x.num) # computing the reciprocal of num
    table.open()
    
  • Add support for strip() on string columns, which will remove the leading and trailing spaces.

  • Add support for floor_division (//).

  • Add support for RIPEMD-160:

    import crandas.crypto.hash as hash
    
    tab = cd.DataFrame({"a": [b"Test 1", b"Test 2"]}, auto_bounds=True)
    h = hash.RIPEMD_160
    h.digest(tab["a"]).open()
    
    # HMAC is also supported
    tab_key = cd.DataFrame({"key": [bytes.fromhex("0123456789abcdef0123456789abcdef01234567")]}, auto_bounds=True)
    hmac = hash.HMAC_RIPEMD_160(tab_key["key"].as_value())
    hmac.digest(tab["a"]).open()
    
  • Add support for AES encryption:

    import crandas.crypto.cipher as cipher
    
    table = cd.DataFrame({"a": [bytes.fromhex("00112233445566778899aabbccddeeff")] * 2}, ctype={"a": "bytes[16]"})
    tab_key = cd.DataFrame({"key": [bytes.fromhex("000102030405060708090a0b0c0d0e0f")]}, ctype={"key": "bytes[16]"})
    aes_128 = cipher.AES_128(tab_key["key"].as_value())
    aes_128.encrypt(table["a"]).open()
    
  • Add support for len and slicing on bytes columns: bytes_col.len() and bytes_col[16:32]

  • Add support for conversion to and from lowercase hex strings: string_col.hex_to_bytes() and bytes_col.bytes_to_hex()

  • Add support for encoding ASCII strings as bytes: ascii_col.encode()

  • Add bitwise operations on bytes columns: AND (&), OR (|), XOR (^), NEGATE (~)

  • Add support for string substitution:

    table = cd.DataFrame({"a": ["PÆR", "á"]}, auto_bounds=True)
    table["a"].substitute({"a": ["á", "à", "ä"], "AE": ["Æ"]}, output_size=4).open()
    
  • Add support for filtering characters:

    table = cd.DataFrame({"a": ["Test string", "More"]}, auto_bounds=True)
    table["a"].filter_chars(["a", "e", "i", "o", "u"]).open()
    
  • Add support for reading stored analyst, approver, and server keys in PEM format

  • Fix bug where uploading a series with only NULL values would give an error

  • Fix bug where repr(cdf), str(cdf) would not deal correctly with zero-row dataframes

  • We move auto_bounds from a Session property to be a configuration variable (using crandas.config.settings.auto_bounds). Having this variable set to True suppresses data-derived column bound warnings by default. Each session object now has a deprecated auto_bounds property that gets/sets the configuration variable.

    BREAKING: This breaks the possibility of a user having two concurrent sessions with different auto_bounds values set.

  • Allow to provide pd.read_csv arguments (e.g., delimiter) as arguments to cd.read_csv

  • Warn user when calling crandas in a conditional context (e.g., an if statement) during script recording. See documentation of the crandas.check_recording module for details.

  • Warn users to specify a validate argument when using merge during script recording. See documentation of the crandas.check_recording module for details.

  • Allow to specify ctype as argument to cd.Series

  • Expose ctype and schema of a column as properties of the classes Col (cdf.columns.cols[ix]) and CSeriesColRef (cdf["col"]); and of a CDataFrame

  • Add function CDataFrame.astype() that converts the type of a individual columns (via ctype parameter) or the full CDataFrame (via schema parameter)

  • Add schema parameters to upload_pandas_dataframe, read_csv, DataFrame, read_parquet functions. For ctype parameter, warn if the corresponding column does not exist

  • Add functions pandas_dataframe_schema and read_csv_schema that return the schema corresponding to a DataFrame or CSV file

  • A server-side schema check for get_table is introduced. When get_table is used in a script, the schema of the resulting table is stored in the recorded script. When the script is used, a server-side check for adherence to the schema is performed.

    BREAKING: using get_table in a script where the tables do not match between recording and using the script, now produces an error; see documentation of get_table for details

  • Add tilde expansion to cd.base.Session.connect()

  • Improved error messages for: using get_table on a non-dummy handle in script recording; invalid arguments to cut, e.g. non-integer bins or labels; sending unauthorized queries where authorization is needed; invalid how argument to merge; use of None-like values in functions (e.g., x.if_else(y, None)); use of unknown ctypes (e.g., ctype={"a": "str"}); uploading fixed-point columns where integers may be intended (e.g., uploading pd.Series([1, 2, None]))

  • Fix bug where the use of a value placeholder (e.g., cdf.assign(b=lambda x: x.a + cd.placeholders.Any(1))) would in many cases not work

1.10.2

This is a bugfix release.

Crandas

  • We update the pyformlang dependency to fix bugs in character ranges

1.10.1

This is a bugfix release.

Crandas

  • We give better errors when receive unexpected responses from the server.

  • We fix a performance regression of the groupby operation, when it is performed on a single F64 column. It is now again as fast as in version 1.9.

1.10.0

The major new feature is expanded support for fixed-point columns.

Crandas

  • Expanded support for fixed-point columns:

    • Fixed point columns now support larger range and precision (96 bits).

    • Fixed point columns now support various statistical functions (min(),max(),sum(),sum_squares(), mean(), var()).

    • Support for arithmetic operations between two fixed point columns, and between fixed-point and integer columns is added. (NB: we do not yet support division; this will be added in a later release.)

    • Support for concatenation of integer and fixed point columns (resulting in a fixed-point column) is added.

    • Support for join and filtering on fixed point columns is added.

    • Parsing of floats on column operations used in operations as filters or assign is supported.

  • The new dropna function removes rows with any missing values from a CDataFrame.

  • The new save can be used to save an object such as a CDataFrame. If persistence is enabled on the server, this means that the object is kept across server restarts. The save command may also be used to attach a name to a computed table, e.g. table.save(name="my_table").

  • The connection file and Session now both have an optional api_token property. This is sent to the server and may be used for authentication purposes.

  • The functions obj.remove() and cd.remove_objects() have been changed to provide more information in case non-existent object(s) are removed.

  • Support for division is added.

    BREAKING: when removing multiple objects using cd.remove_objects(lst), the new behavior is to try to remove all objects even if errors are encountered. The old behavior was to abort on the first error. See the documentation for details.

1.9.2

1.9.1

Crandas

No changes.

1.9.0

Crandas

  • The Session object now has two settings modes, depending on whether a engine connection file is used (recommended method), or whether the endpoint, certificate, and server public keys are specified manually (legacy method). These are reflected in the settings_mode attribute of the Session object.

    When endpoint is set by the user, the Session is set to legacy mode; otherwise, the connection file method is assumed. When the user does not configure anything, the default is to load the default.vdlconn file, residing in the configuration folder (default: ~/.config/crandas, overridable by the CRANDAS_HOME environment variable). The name default.vdlconn can be overriden through the default_connection_file variable. If that file is not present, scan the configuration folder for files with the extension .vdlconn. If there is a single file, use that. If there are multiple, raise an error.

    analyst_key is now a read-write property that returns the nacl SigningKey, and can be set to either a SigningKey, a filename, a path, or None. When set to None, the default key will be loaded. Both the default key file, and the default relative path, depend on the settings mode. For connection file mode, it is analyst.sk and the current working directory in case of a path (Path, or a string that includes a slash “/”); in case of a filename (string that does not include a slash), it is assumed to reside in the configuration folder; for legacy mode it is clientsign.sk and the base_path (to maintain backwards compatibility).

  • Besides the Session object, which is used to configure the connection to the engine, we introduce Dynaconf for user configuration for settings that are not directly related to the connection. The new method provides an easy way for the user to set variables, either using code, using environment variables, or using a settings file (default: settings.toml in the same configuration folder referred to above).

  • We make displaying progress bars configurable using the show_progress_bar and show_progress_bar_after (for the delay in seconds) variables.

  • To make the configuration folder and display the folder in the user’s file browser, the user can now call python -m crandas config.

  • We support the Any placeholder for get_table

  • We support stepless mode in scripts, that can be manually enabled to remove script_step numbers from certain queries. This can be useful together with the Any placeholder, to have queries that can be executed a variable number of times.

  • Add a map_dummy_handles override in call to get_table

  • In CDataFrame.assign, we now support the use of colum names that correspond to engine query arguments (e.g. “name”, “bitlength”).

    BREAKING: existing scripts that use these engine query arguments will now give an error message explaining how these arguments should be specified. Existing authorizations are not affected.

  • Add support for the following operators in regular expressions:

    • {n}: match exactly n times

    • {min,}: match at least min times

    • {,max}: match at most max times

    • {min,max}: match at least min and at most max times

  • Support was added to disable HTTP Keep-Alive in connections to the engine server. This can help solve connection stability issues. Keep-Alive can be disabled in the connection file by setting keepalive = false. The setting can be overriden by the user by using the keepalive parameter of crandas.connect.

  • Add sort_values function to a CDataFrame, which sorts the dataframe according to a column. Example:

    cdf = cd.DataFrame({"a": [3, 1, 4, 5, 2], "b": [1, 2, 3, 4, 5]}, auto_bounds=True)
    cdf = cdf.sort_values("a")
    

    Currently, sorting on strings is not supported.

  • Add support for groupby on multiple columns and on all non-nullable column types.

    For example, this is now possible:

    cdf = cd.DataFrame({"a": ["foo", "bar", "foo", "bar"], "b": [1, 1, 1, 2]}, auto_bounds=True)
    tab = cdf.groupby(["a", "b"]).as_table()
    sorted(zip(tab["a"].open(), tab["b"].open()))
    

    The parameter name of the groupby is renamed from col to cols to reflect these changes. Currently, a maximum of around 100 000 unique values are supported. Above that, the groupby will fail and give an error message. Note that this is the number of unique values. The number of rows can be significantly higher as long as there are less than 100 000 different values in the groupby column(s). Furthermore, a consequence of the new implementation is that the output is not order-stable anymore but random.

  • Add k-nearest neighbors functionality. This allows the target value of a new data point to be predicted based on the existing data using its k nearest neighbors. Example:

    import crandas as cd
    from crandas.crlearn.neighbors import KNeighborsRegressor
    X_train = cd.DataFrame({"input": [0, 1, 2, 3]}, auto_bounds=True)
    y_train = cd.DataFrame({"output": [0, 0, 1, 1]}, auto_bounds=True)
    X_test = cd.DataFrame({"input": [1]}, auto_bounds=True)
    neigh = KNeighborsRegressor(n_neighbors=3)
    neigh.fit(X_train, y_train)
    neigh.predict_value(X_test)
    

    For more information, see crandas.crlearn.neighbors.KNeighborsRegressor.

  • Add a new aggregator crandas.groupby.any that takes any value from the set of values and is faster than crandas.groupby.max/crandas.groupby.min

  • In the HTTP connection to the engine server, use retries for certain HTTP requests to improve robustness

  • Add created property to dataframes and other objects indicating the date and time when they were uploaded or computed

  • Handle cancellation of a query by raising a QueryInterruptedError. This replaces the previous behaviour of returning None and printing “Computation cancelled”. In ipython, the “Computation cancelled” message is still shown.

  • In the progress bar for long-running computations, show “no estimate available yet” as long as progress is at 0% (instead of a more cryptic notation).

  • Add functionality to list uploads to the engine. For more information, see: crandas.stateobject.list_uploads and crandas.stateobject.get_upload_handles.

1.8.1

Crandas fixes

  • crandas.get_table() now ensures connect() is called first

  • Fix upload and decoding of positive numbers of 64 bits In Crandas, trying to upload and download numbers of in the range R = [2^{63}, 2^{64} -1] would previously fail. We fix this issue by mimicking pandas behavior. That is, a number in the range R is returned as an np.uint64. Secondly, w.r.t. uploading, np.uint64, np.uint32, and np.uint16 are now recognized as integers.

1.8.0

Major new features include:

  • Support for bigger (96 bit) integers

  • Progress bars for running queries and the possibility of cancelling running queries

  • Memory usage improvements (client & server)

  • Null value (missing values) support for all column types

  • Searching strings using regular expressions

  • Added a date column type

New features

  • Support for columns with bigger (96 bit) integers

    Just like in the previous version, integers have the ctype int. When specifying the ctype, minimum and maximum bounds for the values can be supplied using the min and max parameters, e.g. int[min=0, max=1000]. Bounds (strictly) between -2^95 and 2^95 are now supported.

    For example, to upload a column "col": [1, 2, 3, 4] as an int use the following ctype spec:

    table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": "int[min=1,max=4]"})
    

    as before.

    To force usage of a particular modulus the integer ctype accepts the keyword argument modulus, which can be set to either of the moduli that are hardcoded in crandas.moduli. For example, to force usage of large integers one can run:

    from crandas.moduli import moduli
    table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})
    

    Notes:

    • crandas will automatically switch to int[modulus={moduli[128]}] if the (derived) bounds do not fit in an int32.

    • crandas will throw an error if the bounds do not fit in an int96.

    We refer to 32-bit integer columns as F64, and 96-bit integer columns as F128, because they are internally represented as 64 and 128 bits numbers, respectively, since we account for a necessary security margin.

    Supported features for large integers:

    • Basic binary arithmetic (+, -, *, ==, <, >, <=, >=) between any two integer columns

    • Groupby and filter on large integers

    • Unary functions on large integer columns, such as mean(), var(), sum(), ...

    • if_else where the 3 arguments guard, ifval, elseval may be any integer column

    • Conversion from 32-bit integer columns to large integer columns via astype and vice versa

    • Vertical concatenation of integer columns based on different moduli

    • Performing a join on columns based on different moduli

    Current limitations:

    • We do not yet support string conversion to large integers

    • json_to_val only allows integers up to int32 yet

    • IntegerList is only defined over F64 yet

    Changes:

    • base.py: deprecated session.modulus

    • crandas.py: class Col and ReturnValue present also the modulus

    • ctypes.py:

      • added support to encode/decode integers of 128 bits

      • made ctype class decoding modulus dependent

    • input.py: mask and unmask are now dependent on the modulus

    • placeholders.py: class Masker now also contains a modulus

    • NEW FILE moduli.py: containing the default moduli for F64 as well as F128.

  • Searching strings and regular expressions

    To search a string column for a particular substring, use the CSeries.contains function:

    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    only_is_rows = table["col"].contains("is")
    table[only_is_rows].open()
    

    Regular expressions are also supported, using the new CSeries.fullmatch function:

    import crandas.re
    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    starts_with_t = table["col"].fullmatch(cd.re.Re("t.*"))
    table[starts_with_t].open()
    

    Regular expressions support the following operations:

    • |: union

    • *: Kleene star (zero or or more)

    • +: one or more

    • ?: zero or one

    • .: any character (note that this also matches non-printable characters)

    • (, ): regexp grouping

    • [...]: set of characters (including character ranges, e.g., [A-Za-z])

    • \\d: digits (equivalent to [0-9])

    • \\s: whitespace (equivalent to [\\\\ \\t\\n\\r\\f\\v])

    • \\w: alphanumeric and underscore (equivalent to [a-zA-Z0-9_])

    • (?1), (?2), …: substring (given as additional argument to CSeries.fullmatch())

    Regular expressions are represented by the class crandas.re.Re. It uses pyformlang’s functionality under the hood.

  • Efficient text operations for ASCII strings

    The varchar ctype now has an ASCII mode for increased efficiency with strings that do only contain ASCII characters (no “special” characters; all codepoints <= 127). Before this change, we only supported general Unicode strings. Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings.

    By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This information (whether or not ASCII mode is used) becomes part of the public metadata of the column, and crandas will give a ColumnBoundDerivedWarning to indicate that the column metadata is derived from the data in the column, unless auto_bounds is set to True.

    Instead of auto-detection, it is also possible to explicitly specify the ctype varchar[ascii] or varchar[unicode], e.g.:

    import crandas as cd
    
    # ASCII autodetected: efficient operations available; warning given
    cdf = cd.DataFrame({"a": ["string"]})
    
    # Unicode autodetected: efficient operations not available; warning given
    cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})
    
    # ASCII annotated; efficient operations available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})
    
    # Unicode annotated; efficient operations not available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})
    
  • Running computations can now be cancelled

    Locally aborting a computation (e.g. Ctrl+C) will now cause it to be cancelled on the server as well.

    • Rename crandas.query to crandas.command to be consistent with server-side implementation and to differentiate from the new crandas.queries module

    • Add module crandas.queries providing client-side implementation of the task-oriented VDL query API, and use this for all queries performed via vdl_query. To perform queries, a block-then-poll strategy is used where first, a blocking query with a timeout of 5 seconds is performed, and if the result is not ready then, status update polls are done at a 1 second interval

  • All column types now support missing values

    All ctypes now support a nullable flag, indicating that values may be missing. It may also be specified using a question mark, e.g. varchar?.

  • Progress reporting for long-running queries

    Queries that take at least 5 seconds now result in a progress bar being displayed that estimates the progress of the computation.

    To enable this for Jupyter notebooks, note that crandas should be installed with the notebook dependency flag, see below.

  • Various memory improvements for both server and client

  • Large data uploads and downloads are now automatically chunked

    Uploads are processed in batches of size crandas.ctypes.ENCODING_CHUNK_SIZE.

  • Added a date column type

    Dates can now be encoded using the date ctype.

    • Dates limited between 1901/01/01 - 2099/12/31 for leap year reasons

    • Ability to subtract two dates to get number of days and add days to a date

    • All comparison operators apply for date

    • Created functions for year, month, day and weekday

    • Able to group over dates, merge and filter

    • New ctype DateCtype converts strings (through pd.to_datetime) and python dates (datetime.date, datetime64 and pd.timestamp) into crandas dates

    • Helper subclass of CSeries _DT allows for pandas-style calling of date retrieval functions (col.dt.year) and standard calls (col.year).

Crandas

  • New dependencies: tqdm and pyformlang

  • New dependency flag: notebook, for features related to Jupyter notebooks. Use pip install crandas[notebook] to install these.

  • Dependency urllib3 is updated to ensure ‘assert_hostname = False’ does work as expected

  • Documentation updates

  • Recording or loading a new script when there is already another script active now no longer gives an error, but a warning message is printed instead.

  • feat(crandas): support with_threshold for aggregation

    This adds support for e.g. table["column"].with_threshold(10).sum(). Before this change, with_threshold() was only supported for filtering operations, e.g. table[filter.with_threshold(5)], and not for aggregation operations (min, max, sum, etc.).

    Note that the alternative that worked before table["column"].sum(threshold=5) is still supported, for both aggregation and filtering operations.

    Minor change: supplying both with_threshold() and a threshold argument now raises a ValueError instead of a TypeError when these are different.

  • implement setter for base_path

    The crandas Session objects now supports setting base_path to either a string, a Path, or None. Retrieving the property will always return a Path.

  • Fix problem where calling size() on a groupby object would fail for int32 columns

  • Improved message for auto-determined bounds

    • Collect all auto_bounds warnings from a data upload into a single warning message

    • Allow to set auto_bounds globally in crandas.base.session