.. meta::
:description: Stay updated with Roseman Labs crandas changelog. Explore the latest features, fixes, and improvements for secure data analysis worldwide
:keywords: crandas, Roseman Labs, changelog
Changelog
=========
.. _engine-and-crandas-release-1.18.0:
1.18.0
------
Crandas
~~~~~~~
Changes:
- **BREAKING**: remove deprecated ``cd.compat.merge_v1()``.
- **BREAKING APPROVAL**: In ``cd.stateobject.list_uploads()``, show
metadata for all objects by default by setting ``also_on_disk=True``
by default. Also, add a progress bar.
- **BREAKING**: Remove Crandas support for Python 3.9. Python 3.9 `is
no longer maintained `__. The
lowest version of Python supported by Crandas is now Python 3.10.
- **BREAKING** Updated minimum version to dependencies ``cryptography``
(46.0.4), ``pynacl`` (1.6.2) and ``urllib3`` (2.6.3) due to fixed
vulnerabilities in those packages
- Calling ``cd.connect()`` now resets any active scripts before
connecting.
- **Changed behavior:** Outputs of the script while recording on dummy
data are now included by default when creating a ``.recording`` file.
The default of ``config.settings.include_outputs`` is changed to
``True``.
- Let opening an aggregated value return a Python int/float instead of
its numpy variant.
- If an error occurs in a script due to an authorization error (e.g.,
placeholder inconsistency, wrong signature), allow to re-perform the
step that caused the error.
- Deprecate ``Session.load_authorizations``,
``Session.clear_authorizations``, and the ``print_json`` query
argument; remove support for the deprecated ``authorization_file``
query argument
New features and enhancements:
- Add support for ‘any’ aggregator on nullable columns.
- Introduce ``cd.groupby.mean`` aggregator that computes the means of a
grouping.
- Add support for the shortcut ``grouping["col"].sum()`` for
``grouping["col"].agg(cd.groupby.sum)``. Similar shortcuts are added
for the other aggregators ``max``, ``min``, ``mean`` and ``any``.
- Add support for ``cd.merge()`` on nullable key columns. Keys that are
null are not included in the result.
- Add module ``crandas.random`` with functions ``random()`` (for
generating random fixed-points in ``[0, 1)``), ``randint(a, b)`` (for
generating random integers in ``[a, b]``) and ``randbytes(n)`` (for
generating random bytestrings of ``n`` bytes).
- Add support for the syntax
``cd.get_table(dummy_handle=DUMMY_HANDLE, prod_handle=PROD_HANDLE)``
and similar for ``cd.get()``. The ``DUMMY_HANDLE`` is used when
either recording a script or when no script is active. The
``PROD_HANDLE`` is used when executing an approved script. The same
function call can be used in all cases, either when recording or
executing a script or when no script is active. The old syntax
``cd.get_table(DUMMY_HANDLE).save(dummy_for=PROD_HANDLE)`` is now
deprecated.
- Add ``cd.vec_from_columns([col1, ..., coln])`` which combines the
numeric columns ``col1``, …, ``coln`` into a single vector column.
- Add support for revocation of approvals
- Add ``drop_duplicates()`` functionality to ``crandas.DataFrame`` and
``crandas.CSeries``. Arguments and functionality is similar to the
pandas equivalent function, except that the key-word arguments
``inplace`` and ``ignore_index`` are not supported.
Fixes:
- Don’t update access time on ``cd.get_table()`` (as used in
e.g. ``cd.stateobject.list_uploads()``) so inspecting an object’s
metadata doesn’t prevent it from being purged.
- When the saved table ``tab1.save(name="name")`` is overwritten by
``tab2.save(name="name")``, the old table ``tab1`` will be moved to
the cache and not continue to be permanently saved.
.. _engine-and-crandas-release-1.17.3:
1.17.3
------
This is an Engine-only release that improves log messages.
.. _engine-and-crandas-release-1.17.2:
1.17.2
------
.. _crandas-1:
Crandas
~~~~~~~
New features and enhancements:
- Add support for Python 3.14.
.. _engine-and-crandas-release-1.17.1:
1.17.1
------
.. _crandas-2:
Crandas
~~~~~~~
Changes:
- Provide progress updates to the user while crandas is connecting to
the server.
.. _engine-and-crandas-release-1.17.0:
1.17.0
------
.. _crandas-3:
Crandas
~~~~~~~
Changes:
- **BREAKING**: Enforce that scripts are executed in linear order in
authorized mode. This functionality can be disabled (e.g. for
backward compatibility) by passing ``linear=False`` to
``cd.script.record()``.
- **BREAKING**: Enforce that approved scripts in authorized mode are
not allowed to continue after an error. This check does not apply for
stepless queries; when linear mode is disabled; and when
``cd.script.allow_errors`` is used.
- **BREAKING APPROVAL**: Fix bug where ``train_test_split`` could not
be used in script recordings if no ``random_state`` was given or if
``train_size``/``test_size`` were provided as floats. This fix breaks
all existing approvals of scripts that use ``train_test_split``. To
use such existing approvals, use
``cd.compat.model_selection.train_test_split`` instead. Note that the
specific rows that are selected as training/test data are also
different from the previous version of the function.
- Add ``cd.script.end()`` which saves a script (if it is being
recorded) or closes it (if it is being executed). The old
``cd.script.save()`` and ``cd.script.close()`` functions are now
deprecated.
- Add ``path`` parameter to ``cd.script.record()`` to specify the
location of the ``.recording`` file at the start of the recording. If
the script filename is specified here, it should not be at the end of
the recording.
New features and enhancements:
- Add support for ``col.any()`` and ``col.all()`` on nullable columns.
- Add option to slice a dataset based on a fraction, e.g.,
``cd.demo_table(10, 10).slice(slice(0.1), allow_fractions=True)``.
Fixes:
- Fix bug where ``train_test_split`` would not work correctly if
``train_size`` and ``test_size`` were both provided and did not add
up to the full dataset.
- Fix bug where the ``shuffle`` option ``train_test_split`` did not
work.
- Fix bug where R2-score in linear regression was computed incorrectly
due to wrong scaling.
- Fix bug where consistency of placeholders between multiple
invocations of ``astype`` and ``concat`` would not be enforced.
.. _engine-and-crandas-release-1.16.1:
1.16.1
------
.. _crandas-4:
Crandas
~~~~~~~
Changes:
- Fix bug where, under high load conditions, the engine nodes may get
out of sync and stop accepting new incoming queries.
- Throw error when attempting to aggregate over a column that is used
in to group-by.
.. _engine-and-crandas-release-1.16.0:
1.16.0
------
This version of crandas will include the user’s Python source code when
recording a script, so that the user will no longer need to separately
upload their analysis source code to the platform. Another notable
feature is added support for standard errors in linear regression.
.. _crandas-5:
Crandas
~~~~~~~
Changes:
- **Changed behavior:** The Python source code is now included by
default when creating a ``.recording`` file. The default of
``config.settings.include_python_script`` is changed to ``True``.
- Combine functions ``DataFrame()``, ``LinearRegression()``, … and
classes ``CDataFrame``, ``CLinearRegression``, … into one. Instead of
``DataFrame(...)`` returning an object of type ``CDataFrame``, it now
returns an object of type ``DataFrame``, etcetera. The original names
``CDataFrame``, ``CLinearRegression``, … are kept as aliases for
``DataFrame``, ``LinearRegression``, …
- Deprecate the ``name`` and ``dummy_for`` query arguments. To assign a
name or dummy data mapping to an object ``obj``, please use
``obj.save(name=...)`` or ``obj.save(dummy_for=...)`` instead of
directly supplying the name or dummy_for argument to the function
that constructs ``obj``. For example, when uploading a table, use
``cd.upload_pandas_dataframe(df).save(name=...)`` instead of
``cd.upload_pandas_dataframe(df, name=...)``. This is to circumvent
the problem that, when using the deprecated form, the name is
overwritten before the upload succeeds.
- **BREAKING** Fix a bug where ``CSeries.isin`` would exceed the
recursion limit when called with collections containing about 1000
items or more. This fix breaks existing authorizations.
New features and enhancements:
- Add support for ``standard_error`` in linear regression for each
estimated coefficient. When opening a linear regression model now
also the standard errors for each estimated coefficient are provided:
.. code:: python
from crandas.crlearn.linear_model import LinearRegression
model = LinearRegression()
model = model.fit(cd.demo_table(10,1), cd.demo_table(10))
model.open()["standard_error_"]
- Add support for ``tab[name] = value`` syntax. For example,
``tab["name"] = tab["name"].lower()`` is now possible.
- Add type hints to return types, making autocomplete work in most
editors.
- Add the ``ascending`` argument to ``CDataFrame.sort_values()``, to
allow sorting on a column in descending order:
.. code:: python
table = cd.DataFrame({"a": [3, 1, 4, 5, 2], "b": [1, 2, 3, 4, 5]}, auto_bounds=True)
# Sorts by column `a` in descending order
sorted_table = table.sort_values("a", ascending=False)
- Add support for ``col.min()`` and ``col.max()`` on nullable columns.
- Add support for ``col.std()``.
- Add support for boolean columns (previously, these were stored as
integer columns).
- Add support for ``cdf.set_axis`` to change column names.
- Let ``.save()`` function of ``DataFrame``, ``LogisticRegression``,
etc. return the object itself, allowing syntax like
``cd.DataFrame(...).save("name").open()``
- When fitting a linear regression model ``model`` it is now detected
whether an all-zero feature column has been provided for fitting
resulting in singularity during the operation, that could cause
significant numerical instability leading to unreliable results. This
is reflected in the boolean ``model.singular_``.
- Validate string to int conversion. When the input contains invalid
inputs, an error is given.
- Provide clearer error messages when using the result of a cancelled
upload/computation; and when applying an operation to an object in a
faulty state.
Fixes:
- Fix engine crash in exceptional circumstances (e.g. triggered by
error handling in complex transaction queries).
- Fix bug that prevented ``tab.astype()`` and ``tab.fillna()`` from
working when the table contained a column with the name ``"name"``.
- Fix bug where parsing/out of memory errors would sometimes be
reported by the engine as an “Internal server error”.
- Fix bug where some validated conversions using ``.astype()`` could
lead to an “Internal server error”.
.. _engine-release-1.15.3:
1.15.3
------
.. _crandas-release-1.15.2:
1.15.2
------
This is a crandas-only bugfix release.
.. _crandas-6:
Crandas
~~~~~~~
- Fix bug where, when connected to the engine in authorized mode,
script recordings would performed in design mode. Performing a script
recording when connected in authorized mode is no longer possible
(unless performing a dry run).
.. _engine-and-crandas-release-1.15.1:
1.15.1
------
This is a crandas-only bugfix release.
.. _crandas-7:
Crandas
~~~~~~~
- Fix bug ``ModuleNotFoundError: No module named 'IPython'`` by making
the import optional.
.. _engine-and-crandas-release-1.15.0:
1.15.0
------
This release contains performance improvements, various usability
improvements, and some improvements to the script recording. For the
script recording, crandas now automatically records the Python commands
that are used in the analyses, as well as the outputs on dummy data.
See the full changelog below.
.. _crandas-8:
Crandas
~~~~~~~
- Add support for ``subset`` parameter in ``dropna``.
- Add support for placeholders in (linear regression, random forest)
model parameters.
- Add support for ``series.isin``.
- Add support for ``.where``, ``.mask`` as an alternative to
``.if_else``.
- Add support for encoding Unicode strings as bytes using UTF-8.
- Add support for ``ddof`` parameter in ``col.var()``.
- Add support for Polars-style ``.when().then()`` conditionals.
- Add Pearson correlation (matrix) for CSeries and CDataFrames.
- Add ``MinMaxScaler`` that replaces ``min_max_normalize`` and that
can, after fitting, be applied multiple times.
- Improved support for NULL values in ``if_else``, including using
None/pd.NA as if/else clause.
- Extend the range of supported values for ``col.sqrt()``.
- When making a recording, crandas can now automatically include the
Python script used for the analysis in the ``.recording`` file. This
is supported when using Jupyter Notebook or when directly executing a
Python file. A future version of the platform will include this
Python script in the approval flow. By default, all code, including
comments, between ``script.record()`` and ``script.save()`` is
included. To enable this functionality, use
``script.record(include_python_script=True)`` or
``config.settings.include_python_script = True``. To include the full
Python script, additionally use
``script.record(include_full_script=True)`` or
``config.settings.include_full_script = True``.
- When making a recording, crandas can now automatically include the
outputs revealed by the recorded script on the design data in the
``.recording`` file. Similar outputs are revealed when the approved
script is executed on production data. A future version of the
platform will include these outputs in the approval flow. To enable
this functionality, use ``script.record(include_outputs=True)`` or
``config.settings.include_outputs = True``.
- Allow to apply ``if_else`` to DataFrames, e.g.,
``cdf["a"].if_else(cdf, cdf2)``.
- Major performance improvements to linear regression. Its performance
is improved by around a factor 25.
- Only show relevant digits for fixed-point bounds.
- Improve performance of ``sort_values`` and ``groupby`` by around a
factor 2.
- **BREAKING** When a placeholder is used at multiple places in a
script, it is now ensured that the same value is used multiple times.
- Provide new API for logistic regression models:
- Logistic regressions now adhere to the ``CModel`` API, allowing
among other things to upload/download fitted models, change
parameters, re-fit, and use placeholders.
- **BREAKING** The output column names of ``.predict()``,
``.predict_proba()`` have changed for consistency with other
types.
- Some functions and arguments have been moved or deprecated; see
the deprecation warnings that are given.
- The original API remains accessible via
``crandas.compat.logistic_regression``.
- Make progress reporting work for all type
(binomial/ordinal/multionomial) + optimizer (LBFGS/…) combinations
- Let a function on a value result in a value instead of a series.
- Fix bug in ``tab.shuffle()`` that dropped the nullability information
when an explicit ``random_state`` was given.
- Fix bug where in some cases the fixed-point value exactly equal to
the provided maximum bound could not be uploaded.
- Fix bug in ``astype`` when converting fixedpoint columns with
non-default precision to a fixedpoint column type without specifying
the precision. Now specifying a precision is mandatory in these
cases.
- The fractional column ctype is removed. When computing a ``mean`` or
``var`` on a column, a ``fp`` return value type is returned when
``mode=regular``. This implies that from now on also secure
computations can be performed on the secret return value. For example
the following is supported:
.. code:: python
df = cd.DataFrame({"x": [1,2,3,4,5]})
y = df["x"].var(mode='regular').sqrt()
y.open()
- Fix bug where crandas could not detect the type of a column with
NULLs and non-unique index values.
.. _engine-and-crandas-release-1.14.3:
1.14.3
------
This release adds a recovery mechanism to the Engine from some faulty
on-disk persistence state, that could arise from large table uploads in
version <= 1.14.1
.. _engine-and-crandas-release-1.14.2:
1.14.2
------
.. _crandas-9:
Crandas
~~~~~~~
- Add progress reporting for uploads and downloads.
- Fix an issue with uploading a table. Now, instead of using a single
POST request, use several requests lasting up to
``settings.http_write_timeout`` seconds (default: 30). This mitigates
the problem that proxy servers (including the one used by Roseman
Labs) commonly set a timeout of 60 seconds on such requests causing
the connection to be killed.
- Fix performance issue when uploading a large date column.
.. _engine-and-crandas-release-1.14.1:
1.14.1
------
.. _engine-and-crandas-release-1.14.0:
1.14.0
------
.. _crandas-10:
Crandas
~~~~~~~
New features
^^^^^^^^^^^^
- Major performance improvements to fixed point division. Its
performance is improved by around a factor 10.
- Add support for random forests (see
``crandas.crlearn.ensemble.RandomForestClassifier``).
- Add conversion from non nullable column types to nullable column
types.
- Add save mode to ``StateObject.save()`` to only save an object if it
does or does not exist.
- Add progress reporting for unsafe many-to-many join.
- Add support for groupby on nullable columns when using
``dropna = False``.
- Add function ``cd.get()`` to retrieve an object from the engine
regardless of its type (to be used in place of ``cd.get_table``,
etc.).
- Add ``col.any()`` and ``col.all()`` functions for boolean columns.
- Add compatibility with NumPy versions 1 and 2
- Add support for conversion of fixedpoint column types to fixedpoint
column types with lower precision.
- Add support for conversion of fixedpoint column types to integer
column types.
- Add support for non-string column names and constructing a crandas
dataframe from a numpy array.
- Add more powerful interface for machine learning models, including
linear regression models, via the ``CModel`` class.
Changes
^^^^^^^
- Allow to use ``**query_args`` in ``obj.remove()``.
- Report size of data in CDataFrame in its Jupyter/command-line
representation.
- Improve dry-run mode such that table uploads and downloads can be
simulated without needing a connection to the engine.
- **BREAKING** Inner product function ``col1.inner(col2)`` requires
that both ``col1`` and ``col2`` has equal elements_per_value.
- **BREAKING** The inner product function on vectors
``col1.inner(col2)`` (e.g., ctypes ``int_vec`` or ``fp_vec``) now
requires that both ``col1`` and ``col2`` have vectors of equal
length, and otherwise raises an error. The old behavior was to have
an inner product of the first *n* entries of both vectors, where *n*
is the smallest of the two vector lengths.
- Get rid of ``ClientError`` in crandas, instead sending a relevant
python error or the new ``ConnectionFileError`` or ``ModelError``.
- **BREAKING** Change ``ServerError`` to ``EngineError``, adding new
error codes and error structure.
- When the computation of the variance in ``tab.describe()`` results in
an overflow, return ``NaN`` instead of raising an exception.
- Column information from ``table.columns`` is now better styled and
reports data sizes per column.
- Remove ``crandas.get2()`` that was only used for testing.
- Remove support for slices with negative step sizes, which could give
incorrect results.
Fixes
^^^^^
- Fix parsing of fixed-point bounds of the form ``fp[min=0,max=1e+20]``
(with a ``+`` in the exponent) which were incorrectly rejected.
- Fix performance issues of ``col.min()`` and ``col.max()``.
- Fix bug where the repr/Jupyter representation of tables with unknown
structure (e.g., from running ``get_table`` in a transaction or in
dry-run mode) would throw an error.
- Fix bug where, in dry-run mode, the length of the randomly generated
handles would be incorrect.
- Fix bug where, if a placeholder was used in a script execution but no
placeholder was used in the script recording, mismatches would not be
detected.
- Fix equality check between a ReturnValue and a constant like
``col.sum(mode="regular") == 0``. This interface allows to
upload/download models; retrieve and set their parameters; and
retrieve their public attributes.
- Fix problem where performing a concat on a very large number of
tables could result in I/O errors.
- Fix incorrect slices on bytes columns in some cases.
- Fix problem preventing uploading large nullable integers.
- Fix bug where ``print_json`` query argument did not work outside of
script recording.
- Fix progress reporting for logistic regression.
- Fix bug in ``function_to_json`` to compute correct ctype bounds.
.. _engine-and-crandas-release-1.13.0:
1.13.0
------
.. _crandas-11:
Crandas
~~~~~~~
- Improve progress reporting for multi-column groupby and sort_values.
- Add integer to string conversion using ``int_col.astype("varchar")``.
- ``cd.read_parquet`` now accepts keyword arguments get passed through
to ``pyarrow.Table.to_pandas``/``ParquetFile``
- Add functions ``sum_squares()``, ``mean()``, ``var()``, ``count()``,
``min()``, ``max()`` to work on the result of series functions. For
example, the following now works:
.. code:: python
table = cd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, auto_bounds=True)
y = (table["a"] * table["b"]).mean()
Before these functions only worked on direct columns of a DataFrame.
Note all of these functions take a ``mode`` keyword argument, which
defaults to ``"open"`` (i.e. open the resulting value). To keep the
result secret-shared, specify ``mode="regular"``.
**BREAKING**: the ``sum()`` function was already implemented, but its
``mode`` argument is now keyword-only instead of positional. That is,
``(table["a"] * table["b"]).sum("regular")`` now needs to be
specified as ``(table["a"] * table["b"]).sum(mode="regular")``.
- Fix a problem where, in some Python configurations (e.g., from the
Spyder Console), the time counter of the progress bar could stall
- If while running a query, crandas loses connection to the server, it
will retry instead of immediately raising an error (see
``wait_for_reconnect`` configuration option)
- Make information shown in the progress bar cleaner
- **BREAKING**: Allow to continue running script after
``NoMatchingAuthorizationError``. Previously, in case of a mismatch
during the use of a recorded script, it was not possible to fix the
mismatch and continue using the script. This was because crandas
moved on to try the next step of the script, even if a step failed.
Now, crandas will keep trying the authorization for the failed step.
It is still possible to manually skip a script step by using
``cd.script.current_script().skip()``.
- Crandas can now connect to the same VDL server instance either in
authorized or in design mode.
- The mode is typically specified by the connection file and can be
retrieved using ``cd.base.session.mode``.
- A warning is now given when connecting in design mode. This
warning can be disabled using the ``warn_design`` setting.
- **BREAKING**: ``crandas.tool.package_connection_file`` script now
needs ``--design``, ``--authorized``, or ``--no-mode`` to indicate
which mode to use
- Add unsafe many-to-many join ``cd.unsafe.merge_m2m``. This join
function is unsafe because it leaks statistical information to the
servers; see documentation for details. Use with care.
- Removed support for ``bitlength`` query argument (deprecated since
version 1.4.0)
- When using crandas in Jupyter, outputting a CDataFrame now is better
styled using HTML
- Change the recommended extension of a script recording from ``.json``
to ``.recording``.
- Fix ``predict_proba`` to work on small numbers of rows (smaller than
the number of features)
- Fix ``predict`` and ``predict_proba`` functionality for ordinal
logistic regression. Previously these functions could return
incorrect results for this regression type.
- Major improvements to fixed point multiplication. Its performance is
improved by around a factor 10 with network usage similar to that of
integer multiplication.
- Fix bug where, if no connection file is found, crandas would give an
error message saying that “there are multiple connection files
present”
- Provide more detailed error messages for missing authorization,
including a high-level hint for some frequently occurring mismatches
- **BREAKING**: The crandas.experimental.stats module has been marked
as stable and was moved to crandas.stats. In addition:
- The ``stats.rankdata()`` function was added
- The ``stats.tiecorrect()`` function was added
- The output of ``stats.chi2_contingency()`` was extended to also
provide the field ``expected_freq``, which contains the expected
frequencies of the observations. The structure remains
backwards-compatible otherwise. In addition, accuracy was
improved.
- The Kruskal-Wallis test ``stats.kruskal()`` was generalized to
properly handle duplicates. The option ``allow_duplicates`` has
been removed since it is no longer applicable. The function
performs stricter input validation to ensure the test is
well-defined.
- The ``stats.chisquare()`` and ``stats.chi2_contingency()``
functions now perform stricter input validation to ensure the test
is well-defined and can be evaluated accurately.
- The ``stats.contingency.expected_freq()`` function performs
stricter input validation to ensure it can be computed accurately.
In addition, accuracy was improved.
.. _engine-and-crandas-release-1.12.3:
1.12.3
------
.. _engine-and-crandas-release-1.12.2:
1.12.2
------
.. _crandas-12:
Crandas
~~~~~~~
- Fix a bug, where reducing the bounds (e.g., using ``astype``) breaks
the server when it tries to switch to a smaller internal
representation.
- Allow fixedpoint ctype specification to define min and max as floats.
For example:
.. code:: python
table = cd.DataFrame({"vals": [0.75, 1.25, 2.25]}, ctype={"vals": "fp[min=0.5,max=2.5]"})
- **BREAKING**: Statistics functionality is moved to the
``cd.experimental`` namespace because of potential inaccuracies for
large inputs
- Add ``cd.experimental.stats.contingency.crosstab`` for computing
contingency tables
- Add ``crandas.CSeries.as_series`` to obtain a ``ReturnValue`` for the
series
- Add ``crandas.experimental.stats.chi2_contingency`` for performing
chisquare test on contingency table
- Add ``crandas.experimental.stats.contingency.expected_freq`` for
computing expected frequences
- Fix ``crandas.experimental.stats.ttest_ind()`` to work with unequal
variances for larger data
- Fix ``crandas.experimental.stats.chisquare`` to provide better error
message when input contains zeros
- Added the ``crandas.compat``, ``crandas.crypto`` and
``crandas.experimental.stats`` packages in pyproject.toml via
automatic package discovery.
.. _engine-and-crandas-release-1.12.1:
1.12.1
------
.. _crandas-13:
Crandas
~~~~~~~
- Fix a bug in fixed-point multiplication; in certain cases this bug
caused incorrect fixed-point multiplication results.
- Fix an issue related to the printing of the min and max value of a
fixed-point ctype
- Fix bug where, in some cases, a schema mismatch between design and
production would cause an exception rather than a user-friendly error
message
.. _engine-and-crandas-release-1.12.0:
1.12.0
------
.. _crandas-14:
Crandas
~~~~~~~
- Major performance improvements for comparisons and related operations
like ``sort_values`` and ``groupby``. The performance of these
operations is improved by around a factor 10.
- Fix bugs in ``substitute`` in the following edge-cases:
- Fix crash when the ``output_size`` parameter is set too high.
- Fix potential invalid substitutions when Unicode substitutions are
applied to ASCII strings.
- Add support for fixed-point arrays, including array functions
``inner`` and ``vsum``.
- Add support for ``cd.cut`` on fixed-point values, bins and labels.
For example:
.. code:: python
table = cd.DataFrame({"vals":[1.1, 2.1, 2.2, 3.3]}, ctype={"vals":"fp[precision=30]"})
table = table.assign(cut=cd.cut(table["vals"], [1.5,2.5], labels=[0.1,0.2,0.3], add_inf=True))
- Fix bug which prevented uploading bytes columns with more than 100
000 values.
- Add support for Base64 encoding and decoding:
``bytes_col.b64encode()`` and ``string_col.b64decode()``:
.. code:: python
table = cd.DataFrame({"base64": ["TWFu", "bGlnaHQgd29yaw=="]}, auto_bounds=True)
table = table.assign(bytes=table["base64"].b64decode()) # table["bytes"] set to [b"Man", b"light work"]
table["bytes"].b64encode() # equal to table["base64"]
- Add support for performing a groupby on more than 100 000 unique
values.
- Add support for computing the square root ``cdf["col"].sqrt()``
- Add module ``crandas.stats`` for performing a number of statistical
tests
- ``override_version_check`` can now be supplied to the Session
constructor, so
``cd.connect("connection-file", override_version_check=False)`` now
works. Additionally, the ``CRANDAS_OVERRIDE_VERSION_CHECK``
environment variable was moved to the ``cd.config`` framework.
- Support ``mode`` and other query arguments for ``CSeries.sum()``,
e.g. so that the following now works without opening values
.. code:: python
table = cd.DataFrame({"values": [1, 2, 3]})
x = (table["values"] ** 2).sum(mode="regular")
# x is not opened
- Improve support for using multiple sessions at the same time with the
``session`` query argument and the use of a session as a context
handler
**BREAKING**: A check is introduced to ensure that objects from
different sessions are not mixed. (Previously, if multiple sessions
to the same endpoint were used, it was possible to mix objects.)
.. _engine-and-crandas-release-1.11.0:
1.11.0
------
.. _crandas-15:
Crandas
~~~~~~~
- Preserve len(df) of pandas DataFrames without columns
- Support for the concatenation of strings. For example:
.. code:: python
table = cd.DataFrame({"first_name": ["John", "Jan"], "last_name": ["Doe", "Jansen"]}, auto_bounds=True)
full_names = table["first_name"] + " " + table["last_name"]
full_names.open()
- Add support for ``upper()``, similar to the existing ``lower()``. In
addition, it is now also possible to only change the case of specific
indices:
.. code:: python
table = cd.DataFrame({"name": ["john", "Jansen"]}, auto_bounds=True)
table["name"].upper([0]).open() # Returns ["John", "Jansen"]
table["name"].upper([1, 3, 5]).open() # Returns ["jOhN", "JAnSeN"]
- ``crandas.tool.check_connection`` now cleans up tables after running
a dummy computation
- Perform parquet uploads through ``read_parquet``
- Improved implementation of ``cd.merge``:
- Add support for right and one-to-many joins. Now, any combination
of left/right/inner/outer one-to-one/one-to-many/many-to-one joins
is supported.
- Improved support for nullable columns. When performing a non-inner
join, now nullable columns are output (previously, a non-nullable
column with the value zero was returned in some cases)
- Deal with key columns with different colum names consistently with
pandas, e.g., when joining with ``left_on="a"`` and
``right_on="b"``, return two columns with the left and right key
values, respectively
- Add support for suffixes
- Improved progress reporting for one-to-many joins
- Allow trivial grouping-based join where the right table only has
key columns
**BREAKING**: The signature of the ``cd.merge`` function has not
changed, but, because of the above changes, the resulting table may
have a different set of columns and/or columns of different types
than in previous versions. Moreover, the underlying VDL API command
has been renamed and changed internally so that existing
authorizations do not apply to the new merge. The old merge is
available via crandas as ``cd.compat.merge_v1``.
- Fix bug in bytes column when using non-ASCII characters. Opening such
values could give incorrect results.
- Add support for (fixed-point) division (``/``). For example:
.. code:: python
table = cd.DataFrame({"num": [1.2, 3.4, 5.6], "denom": [2.1, 4.3, 5.4]}, auto_bounds=True)
table = table.assign(div = lambda x: x.num / x.denom)
table = table.assign(rec_num = lambda x: 1 / x.num) # computing the reciprocal of num
table.open()
- Add support for ``strip()`` on string columns, which will remove the
leading and trailing spaces.
- Add support for floor_division (``//``).
- Add support for RIPEMD-160:
.. code:: python
import crandas.crypto.hash as hash
tab = cd.DataFrame({"a": [b"Test 1", b"Test 2"]}, auto_bounds=True)
h = hash.RIPEMD_160
h.digest(tab["a"]).open()
# HMAC is also supported
tab_key = cd.DataFrame({"key": [bytes.fromhex("0123456789abcdef0123456789abcdef01234567")]}, auto_bounds=True)
hmac = hash.HMAC_RIPEMD_160(tab_key["key"].as_value())
hmac.digest(tab["a"]).open()
- Add support for AES encryption:
.. code:: python
import crandas.crypto.cipher as cipher
table = cd.DataFrame({"a": [bytes.fromhex("00112233445566778899aabbccddeeff")] * 2}, ctype={"a": "bytes[16]"})
tab_key = cd.DataFrame({"key": [bytes.fromhex("000102030405060708090a0b0c0d0e0f")]}, ctype={"key": "bytes[16]"})
aes_128 = cipher.AES_128(tab_key["key"].as_value())
aes_128.encrypt(table["a"]).open()
- Add support for ``len`` and slicing on bytes columns:
``bytes_col.len()`` and ``bytes_col[16:32]``
- Add support for conversion to and from lowercase hex strings:
``string_col.hex_to_bytes()`` and ``bytes_col.bytes_to_hex()``
- Add support for encoding ASCII strings as bytes:
``ascii_col.encode()``
- Add bitwise operations on bytes columns: AND (``&``), OR (``|``), XOR
(``^``), NEGATE (``~``)
- Add support for string substitution:
.. code:: python
table = cd.DataFrame({"a": ["PÆR", "á"]}, auto_bounds=True)
table["a"].substitute({"a": ["á", "à", "ä"], "AE": ["Æ"]}, output_size=4).open()
- Add support for filtering characters:
.. code:: python
table = cd.DataFrame({"a": ["Test string", "More"]}, auto_bounds=True)
table["a"].filter_chars(["a", "e", "i", "o", "u"]).open()
- Add support for reading stored analyst, approver, and server keys in
PEM format
- Fix bug where uploading a series with only NULL values would give an
error
- Fix bug where ``repr(cdf)``, ``str(cdf)`` would not deal correctly
with zero-row dataframes
- We move ``auto_bounds`` from a ``Session`` property to be a
configuration variable (using
``crandas.config.settings.auto_bounds``). Having this variable set to
True suppresses data-derived column bound warnings by default. Each
session object now has a deprecated ``auto_bounds`` property that
gets/sets the configuration variable.
**BREAKING:** This breaks the possibility of a user having two
concurrent sessions with different ``auto_bounds`` values set.
- Allow to provide ``pd.read_csv`` arguments (e.g., ``delimiter``) as
arguments to ``cd.read_csv``
- Warn user when calling crandas in a conditional context (e.g., an
``if`` statement) during script recording. See documentation of the
``crandas.check_recording`` module for details.
- Warn users to specify a ``validate`` argument when using ``merge``
during script recording. See documentation of the
``crandas.check_recording`` module for details.
- Allow to specify ``ctype`` as argument to ``cd.Series``
- Expose ``ctype`` and ``schema`` of a column as properties of the
classes ``Col`` (``cdf.columns.cols[ix]``) and ``CSeriesColRef``
(``cdf["col"]``); and of a ``CDataFrame``
- Add function ``CDataFrame.astype()`` that converts the type of a
individual columns (via ``ctype`` parameter) or the full CDataFrame
(via ``schema`` parameter)
- Add ``schema`` parameters to ``upload_pandas_dataframe``,
``read_csv``, ``DataFrame``, ``read_parquet`` functions. For
``ctype`` parameter, warn if the corresponding column does not exist
- Add functions ``pandas_dataframe_schema`` and ``read_csv_schema``
that return the schema corresponding to a DataFrame or CSV file
- A server-side schema check for get_table is introduced. When
get_table is used in a script, the schema of the resulting table is
stored in the recorded script. When the script is used, a server-side
check for adherence to the schema is performed.
**BREAKING**: using get_table in a script where the tables do not
match between recording and using the script, now produces an error;
see documentation of ``get_table`` for details
- Add tilde expansion to cd.base.Session.connect()
- Improved error messages for: using ``get_table`` on a non-dummy
handle in script recording; invalid arguments to ``cut``,
e.g. non-integer bins or labels; sending unauthorized queries where
authorization is needed; invalid ``how`` argument to ``merge``; use
of ``None``-like values in functions (e.g., ``x.if_else(y, None)``);
use of unknown ctypes (e.g., ``ctype={"a": "str"}``); uploading
fixed-point columns where integers may be intended (e.g., uploading
``pd.Series([1, 2, None])``)
- Fix bug where the use of a value placeholder (e.g.,
``cdf.assign(b=lambda x: x.a + cd.placeholders.Any(1))``) would in
many cases not work
.. _engine-and-crandas-release-1.10.2:
1.10.2
------
This is a bugfix release.
.. _crandas-16:
Crandas
~~~~~~~
- We update the pyformlang dependency to fix bugs in character ranges
.. _engine-and-crandas-release-1.10.1:
1.10.1
------
This is a bugfix release.
.. _crandas-17:
Crandas
~~~~~~~
- We give better errors when receive unexpected responses from the
server.
- We fix a performance regression of the groupby operation, when it is
performed on a single F64 column. It is now again as fast as in
version 1.9.
.. _engine-and-crandas-release-1.10.0:
1.10.0
------
The major new feature is expanded support for fixed-point columns.
.. _crandas-18:
Crandas
~~~~~~~
- Expanded support for fixed-point columns:
- Fixed point columns now support larger range and precision (96
bits).
- Fixed point columns now support various statistical functions
(``min()``,\ ``max()``,\ ``sum()``,\ ``sum_squares()``,
``mean()``, ``var()``).
- Support for arithmetic operations between two fixed point columns,
and between fixed-point and integer columns is added. (NB: we do
not yet support division; this will be added in a later release.)
- Support for concatenation of integer and fixed point columns
(resulting in a fixed-point column) is added.
- Support for join and filtering on fixed point columns is added.
- Parsing of floats on column operations used in operations as
filters or assign is supported.
- The new ``dropna`` function removes rows with any missing values from
a CDataFrame.
- The new ``save`` can be used to save an object such as a CDataFrame.
If persistence is enabled on the server, this means that the object
is kept across server restarts. The ``save`` command may also be used
to attach a *name* to a computed table,
e.g. ``table.save(name="my_table")``.
- The connection file and ``Session`` now both have an optional
``api_token`` property. This is sent to the server and may be used
for authentication purposes.
- The functions ``obj.remove()`` and ``cd.remove_objects()`` have been
changed to provide more information in case non-existent object(s)
are removed.
- Support for division is added.
**BREAKING**: when removing multiple objects using
``cd.remove_objects(lst)``, the new behavior is to try to remove all
objects even if errors are encountered. The old behavior was to abort
on the first error. See the documentation for details.
.. _engine-and-crandas-release-1.9.2:
1.9.2
-----
.. _engine-and-crandas-release-1.9.1:
1.9.1
-----
.. _crandas-19:
Crandas
~~~~~~~
No changes.
.. _engine-and-crandas-release-1.9.0:
1.9.0
-----
.. _crandas-20:
Crandas
~~~~~~~
- The ``Session`` object now has two settings modes, depending on
whether a engine connection file is used (recommended method), or
whether the endpoint, certificate, and server public keys are
specified manually (legacy method). These are reflected in the
``settings_mode`` attribute of the ``Session`` object.
When ``endpoint`` is set by the user, the ``Session`` is set to
legacy mode; otherwise, the connection file method is assumed. When
the user does not configure anything, the default is to load the
``default.vdlconn`` file, residing in the configuration folder
(default: ``~/.config/crandas``, overridable by the ``CRANDAS_HOME``
environment variable). The name ``default.vdlconn`` can be overriden
through the ``default_connection_file`` variable. If that file is not
present, scan the configuration folder for files with the extension
``.vdlconn``. If there is a single file, use that. If there are
multiple, raise an error.
``analyst_key`` is now a read-write property that returns the nacl
SigningKey, and can be set to either a SigningKey, a filename, a
path, or None. When set to None, the default key will be loaded. Both
the default key file, and the default relative path, depend on the
settings mode. For connection file mode, it is ``analyst.sk`` and the
current working directory in case of a path (Path, or a string that
includes a slash “/”); in case of a filename (string that does not
include a slash), it is assumed to reside in the configuration
folder; for legacy mode it is ``clientsign.sk`` and the base_path (to
maintain backwards compatibility).
- Besides the ``Session`` object, which is used to configure the
connection to the engine, we introduce
`Dynaconf `__ for user configuration for
settings that are not directly related to the connection. The new
method provides an easy way for the user to set variables, either
using code, using environment variables, or using a settings file
(default: ``settings.toml`` in the same configuration folder referred
to above).
- We make displaying progress bars configurable using the
``show_progress_bar`` and ``show_progress_bar_after`` (for the delay
in seconds) variables.
- To make the configuration folder and display the folder in the user’s
file browser, the user can now call ``python -m crandas config``.
- We support the ``Any`` placeholder for ``get_table``
- We support ``stepless`` mode in scripts, that can be manually enabled
to remove ``script_step`` numbers from certain queries. This can be
useful together with the ``Any`` placeholder, to have queries that
can be executed a variable number of times.
- Add a ``map_dummy_handles`` override in call to ``get_table``
- In ``CDataFrame.assign``, we now support the use of colum names that
correspond to engine query arguments (e.g. “name”, “bitlength”).
**BREAKING**: existing scripts that use these engine query arguments
will now give an error message explaining how these arguments should
be specified. Existing authorizations are not affected.
- Add support for the following operators in regular expressions:
- ``{n}``: match exactly n times
- ``{min,}``: match at least min times
- ``{,max}``: match at most max times
- ``{min,max}``: match at least min and at most max times
- Support was added to disable HTTP Keep-Alive in connections to the
engine server. This can help solve connection stability issues.
Keep-Alive can be disabled in the connection file by setting
``keepalive = false``. The setting can be overriden by the user by
using the ``keepalive`` parameter of ``crandas.connect``.
- Add ``sort_values`` function to a ``CDataFrame``, which sorts the
dataframe according to a column. Example:
.. code:: python
cdf = cd.DataFrame({"a": [3, 1, 4, 5, 2], "b": [1, 2, 3, 4, 5]}, auto_bounds=True)
cdf = cdf.sort_values("a")
Currently, sorting on strings is not supported.
- Add support for groupby on multiple columns and on all non-nullable
column types.
For example, this is now possible:
.. code:: python
cdf = cd.DataFrame({"a": ["foo", "bar", "foo", "bar"], "b": [1, 1, 1, 2]}, auto_bounds=True)
tab = cdf.groupby(["a", "b"]).as_table()
sorted(zip(tab["a"].open(), tab["b"].open()))
The parameter name of the groupby is renamed from ``col`` to ``cols``
to reflect these changes. Currently, a maximum of around 100 000
unique values are supported. Above that, the groupby will fail and
give an error message. Note that this is the number of *unique*
values. The number of rows can be significantly higher as long as
there are less than 100 000 different values in the groupby
column(s). Furthermore, a consequence of the new implementation is
that the output is not order-stable anymore but random.
- Add k-nearest neighbors functionality. This allows the target value
of a new data point to be predicted based on the existing data using
its k nearest neighbors. Example:
.. code:: python
import crandas as cd
from crandas.crlearn.neighbors import KNeighborsRegressor
X_train = cd.DataFrame({"input": [0, 1, 2, 3]}, auto_bounds=True)
y_train = cd.DataFrame({"output": [0, 0, 1, 1]}, auto_bounds=True)
X_test = cd.DataFrame({"input": [1]}, auto_bounds=True)
neigh = KNeighborsRegressor(n_neighbors=3)
neigh.fit(X_train, y_train)
neigh.predict_value(X_test)
For more information, see
``crandas.crlearn.neighbors.KNeighborsRegressor``.
- Add a new aggregator ``crandas.groupby.any`` that takes any value
from the set of values and is faster than
``crandas.groupby.max``/``crandas.groupby.min``
- In the HTTP connection to the engine server, use retries for certain
HTTP requests to improve robustness
- Add ``created`` property to dataframes and other objects indicating
the date and time when they were uploaded or computed
- Handle cancellation of a query by raising a
``QueryInterruptedError``. This replaces the previous behaviour of
returning ``None`` and printing “Computation cancelled”. In ipython,
the “Computation cancelled” message is still shown.
- In the progress bar for long-running computations, show “no estimate
available yet” as long as progress is at 0% (instead of a more
cryptic notation).
- Add functionality to list uploads to the engine. For more
information, see: ``crandas.stateobject.list_uploads`` and
``crandas.stateobject.get_upload_handles``.
.. _vdl-and-crandas-release-1.8.1:
1.8.1
-----
Crandas fixes
~~~~~~~~~~~~~
- ``crandas.get_table()`` now ensures ``connect()`` is called first
- Fix upload and decoding of positive numbers of 64 bits In Crandas,
trying to upload and download numbers of in the range
``R = [2^{63}, 2^{64} -1]`` would previously fail. We fix this issue
by mimicking pandas behavior. That is, a number in the range ``R`` is
returned as an ``np.uint64``. Secondly, w.r.t. uploading,
``np.uint64``, ``np.uint32``, and ``np.uint16`` are now recognized as
integers.
.. _vdl-and-crandas-release-1.8.0:
1.8.0
-----
Major new features include:
- Support for bigger (96 bit) integers
- Progress bars for running queries and the possibility of cancelling
running queries
- Memory usage improvements (client & server)
- Null value (missing values) support for all column types
- Searching strings using regular expressions
- Added a date column type
.. _new-features-1:
New features
~~~~~~~~~~~~
- Support for columns with bigger (96 bit) integers
Just like in the previous version, integers have the ctype ``int``.
When specifying the ctype, minimum and maximum bounds for the values
can be supplied using the ``min`` and ``max`` parameters,
e.g. ``int[min=0, max=1000]``. Bounds (strictly) between -2^95 and
2^95 are now supported.
For example, to upload a column ``"col": [1, 2, 3, 4]`` as an ``int``
use the following ``ctype spec``:
.. code:: python
table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": "int[min=1,max=4]"})
as before.
To force usage of a particular modulus the integer ctype accepts the
keyword argument ``modulus``, which can be set to either of the
moduli that are hardcoded in ``crandas.moduli``. For example, to
force usage of large integers one can run:
.. code:: python
from crandas.moduli import moduli
table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})
Notes:
- crandas will automatically switch to
``int[modulus={moduli[128]}]`` if the (derived) bounds do not fit
in an ``int32``.
- crandas will throw an error if the bounds do not fit in an
``int96``.
We refer to 32-bit integer columns as F64, and 96-bit integer columns
as F128, because they are internally represented as 64 and 128 bits
numbers, respectively, since we account for a necessary security
margin.
Supported features for large integers:
- Basic binary arithmetic ``(+, -, *, ==, <, >, <=, >=)`` between
any two integer columns
- Groupby and filter on large integers
- Unary functions on large integer columns, such as
``mean(), var(), sum(), ...``
- ``if_else`` where the 3 arguments ``guard``, ``ifval``,
``elseval`` may be any integer column
- Conversion from 32-bit integer columns to large integer columns
via ``astype`` and vice versa
- Vertical concatenation of integer columns based on different
moduli
- Performing a join on columns based on different moduli
Current limitations:
- We do not yet support string conversion to large integers
- ``json_to_val`` only allows integers up to int32 yet
- IntegerList is only defined over F64 yet
Changes:
- ``base.py``: deprecated ``session.modulus``
- ``crandas.py``: class ``Col`` and ``ReturnValue`` present also the
``modulus``
- ``ctypes.py``:
- added support to encode/decode integers of 128 bits
- made ctype class decoding modulus dependent
- ``input.py``: ``mask`` and ``unmask`` are now dependent on the
modulus
- ``placeholders.py``: class Masker now also contains a modulus
- NEW FILE ``moduli.py``: containing the default moduli for F64 as
well as F128.
- Searching strings and regular expressions
To search a string column for a particular substring, use the
``CSeries.contains`` function:
.. code:: python
table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
only_is_rows = table["col"].contains("is")
table[only_is_rows].open()
Regular expressions are also supported, using the new
``CSeries.fullmatch`` function:
.. code:: python
import crandas.re
table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
starts_with_t = table["col"].fullmatch(cd.re.Re("t.*"))
table[starts_with_t].open()
Regular expressions support the following operations:
- ``|``: union
- ``*``: Kleene star (zero or or more)
- ``+``: one or more
- ``?``: zero or one
- ``.``: any character (note that this also matches non-printable
characters)
- ``(``, ``)``: regexp grouping
- ``[...]``: set of characters (including character ranges, e.g.,
``[A-Za-z]``)
- ``\\d``: digits (equivalent to ``[0-9]``)
- ``\\s``: whitespace (equivalent to ``[\\\\ \\t\\n\\r\\f\\v]``)
- ``\\w``: alphanumeric and underscore (equivalent to
``[a-zA-Z0-9_]``)
- ``(?1)``, ``(?2)``, …: substring (given as additional argument to
``CSeries.fullmatch()``)
Regular expressions are represented by the class ``crandas.re.Re``.
It uses pyformlang’s functionality under the hood.
- Efficient text operations for ASCII strings
The ``varchar`` ctype now has an ASCII mode for increased efficiency
with strings that do only contain ASCII characters (no “special”
characters; all codepoints <= 127). Before this change, we only
supported general Unicode strings. Certain operations (in particular,
comparison, searching, and regular expression matching), are more
efficient for ASCII strings.
By default, crandas autodetects whether or not the more efficient
ASCII mode can be used. This information (whether or not ASCII mode
is used) becomes part of the public metadata of the column, and
crandas will give a ``ColumnBoundDerivedWarning`` to indicate that
the column metadata is derived from the data in the column, unless
``auto_bounds`` is set to True.
Instead of auto-detection, it is also possible to explicitly specify
the ctype ``varchar[ascii]`` or ``varchar[unicode]``, e.g.:
.. code:: python
import crandas as cd
# ASCII autodetected: efficient operations available; warning given
cdf = cd.DataFrame({"a": ["string"]})
# Unicode autodetected: efficient operations not available; warning given
cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})
# ASCII annotated; efficient operations available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})
# Unicode annotated; efficient operations not available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})
- Running computations can now be cancelled
Locally aborting a computation (e.g. Ctrl+C) will now cause it to be
cancelled on the server as well.
- Rename crandas.query to crandas.command to be consistent with
server-side implementation and to differentiate from the new
crandas.queries module
- Add module crandas.queries providing client-side implementation of
the task-oriented VDL query API, and use this for all queries
performed via vdl_query. To perform queries, a block-then-poll
strategy is used where first, a blocking query with a timeout of 5
seconds is performed, and if the result is not ready then, status
update polls are done at a 1 second interval
- All column types now support missing values
All ctypes now support a ``nullable`` flag, indicating that values
may be missing. It may also be specified using a question mark,
e.g. ``varchar?``.
- Progress reporting for long-running queries
Queries that take at least 5 seconds now result in a progress bar
being displayed that estimates the progress of the computation.
To enable this for Jupyter notebooks, note that crandas should be
installed with the ``notebook`` dependency flag, see below.
- Various memory improvements for both server and client
- Large data uploads and downloads are now automatically chunked
Uploads are processed in batches of size
``crandas.ctypes.ENCODING_CHUNK_SIZE``.
- Added a date column type
Dates can now be encoded using the ``date`` ctype.
- Dates limited between 1901/01/01 - 2099/12/31 for leap year
reasons
- Ability to subtract two dates to get number of days and add days
to a date
- All comparison operators apply for date
- Created functions for ``year``, ``month``, ``day`` and ``weekday``
- Able to group over dates, merge and filter
- New ctype ``DateCtype`` converts strings (through
``pd.to_datetime``) and python dates (``datetime.date``,
``datetime64`` and ``pd.timestamp``) into crandas dates
- Helper subclass of ``CSeries`` ``_DT`` allows for pandas-style
calling of date retrieval functions (``col.dt.year``) *and*
standard calls (``col.year``).
.. _crandas-21:
Crandas
~~~~~~~
- New dependencies: ``tqdm`` and ``pyformlang``
- New dependency flag: ``notebook``, for features related to Jupyter
notebooks. Use ``pip install crandas[notebook]`` to install these.
- Dependency urllib3 is updated to ensure ‘assert_hostname = False’
does work as expected
- Documentation updates
- Recording or loading a new script when there is already another
script active now no longer gives an error, but a warning message is
printed instead.
- feat(crandas): support with_threshold for aggregation
This adds support for
e.g. ``table["column"].with_threshold(10).sum()``. Before this
change, ``with_threshold()`` was only supported for filtering
operations, e.g. ``table[filter.with_threshold(5)]``, and not for
aggregation operations (min, max, sum, etc.).
Note that the alternative that worked before
``table["column"].sum(threshold=5)`` is still supported, for both
aggregation and filtering operations.
Minor change: supplying both with_threshold() and a threshold
argument now raises a ValueError instead of a TypeError when these
are different.
- implement setter for base_path
The crandas ``Session`` objects now supports setting ``base_path`` to
either a string, a Path, or None. Retrieving the property will always
return a Path.
- Fix problem where calling size() on a groupby object would fail for
int32 columns
- Improved message for auto-determined bounds
- Collect all auto_bounds warnings from a data upload into a single
warning message
- Allow to set auto_bounds globally in crandas.base.session