.. _changelog:

Changelog
###########

v1.8.0
-------

Major new features include:

- Support for bigger (96 bit) integers
- Progress bars for running queries and the possibility of cancelling running queries
- Memory usage improvements (client & server)
- Null value (missing values) support for all column types
- Searching strings using regular expressions
- Added a date column type

New features
==============

- Support for columns with bigger (96 bit) integers

  Just like in the previous version, integers have the ctype ``int``. When specifying the ctype,
  minimum and maximum bounds for the values can be supplied using the ``min`` and ``max`` parameters,
  e.g. ``int[min=0, max=1000]``. Bounds (strictly) between -2^95 and 2^95 are now supported.

  For example, to upload a column ``"col": [1, 2, 3, 4]`` as an ``int`` use the following ``ctype_spec``:

  .. code:: python

      table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": "int[min=1,max=4]"})

  as before.

  To force usage of a particular modulus the integer ctype accepts the keyword argument ``modulus``. 
  For example, to force usage of large integers one can run:

  .. code:: python

    from crandas.moduli import moduli
    table = cd.DataFrame({"col":[1, 2, 3, 4]},  ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})

  
  Notes:
  * crandas will automatically switch to ``int[modulus={moduli[128]}]`` if the (derived) bounds do not fit in an ``int32``.
  * crandas will throw an error if the bounds do not fit in an ``int96``.
  
  We refer to 32-bit integer columns as F64, and 96-bit integer columns as F128, because they are internally represented as 64 and 128 bits numbers, respectively, since we account for a necessary security margin.

  **Supported features for large integers:**
  
  * Basic binary arithmetic (``+``, ``-``, ``*``, ``==``, ``<``, ``>``, ``<=``, ``>=``) between any two integer columns
  * Groupby and filter on large integers
  * Unary functions on large integer columns, such as ``mean(), var(), sum(), ...``
  * ``if_else`` where the 3 arguments ``guard``, ``ifval``, ``elseval`` may be any integer column
  * Conversion from 32-bit integer columns to large integer columns via ``astype`` and vice versa
  * Vertical concatenation of integer columns  based on different moduli
  * Performing a join on columns based on different moduli
  
  Current limitations:

  * We do not yet support string conversion to large integers
  * ``json_to_val`` only allows integers up to int32 yet
  * IntegerList is only defined over F64 yet

- Searching strings and regular expressions

  To search a string column for a particular substring, use the ``CSeries.contains`` function:
  
  .. code:: python

    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    only_is_rows = table["col"].contains("is")
    table[only_is_rows].open()


  Regular expressions are also supported, using the new ``CSeries.fullmatch`` function:

  .. code:: python

    import crandas.re
    table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]})
    starts_with_t = table["col"].fullmatch(cd.re.Re("t.*"))
    table[starts_with_t].open()


  Regular expressions support the following operations:

  * ``|``: union
  * ``*``: Kleene star (zero or or more)
  * ``+``: one or more
  * ``?``: zero or one
  * ``.``: any character (note that this also matches non-printable characters)
  * ``(``, ``)``: regexp grouping
  * ``[...]``: set of characters (including character ranges, e.g., ``[A-Za-z]``)
  * ``\\d``: digits (equivalent to ``[0-9]``)
  * ``\\s``: whitespace (equivalent to ``[\\\\ \\t\\n\\r\\f\\v]``)
  * ``\\w``: alphanumeric and underscore (equivalent to ``[a-zA-Z0-9_]``)
  * ``(?1)``, ``(?2)``, ...: substring (given as additional argument to ``CSeries.fullmatch()``)

  Regular expressions are represented by the class ``crandas.re.Re``. It uses pyformlang's
  functionality under the hood.

- Efficient text operations for ASCII strings

  The ``varchar`` ctype now has an ASCII mode for increased efficiency with strings that do only
  contain ASCII characters (no "special" characters; all codepoints <= 127). Before this change, we
  only supported general Unicode strings. Certain operations (in particular, comparison, searching,
  and regular expression matching), are more efficient for ASCII strings.

  By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This
  information (whether or not ASCII mode is used) becomes part of the public metadata of the column,
  and crandas will give a ``ColumnBoundDerivedWarning`` to indicate that the column metadata is
  derived from the data in the column, unless ``auto_bounds`` is set to True.

  Instead of auto-detection, it is also possible to explicitly specify the ctype ``varchar[ascii]`` or
  ``varchar[unicode]``, e.g.:

.. code:: python

    import crandas as cd

    # ASCII autodetected: efficient operations available; warning given
    cdf = cd.DataFrame({"a": ["string"]})

    # Unicode autodetected: efficient operations not available; warning given
    cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})

    # ASCII annotated; efficient operations available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})

    # Unicode annotated; efficient operations not available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})


- Running computations can now be cancelled

  Locally aborting a computation (e.g. Ctrl+C) will now cause it to be cancelled on the server as
  well.

- All column types now support missing values

  All ctypes now support a ``nullable`` flag, indicating that values may be missing. It may also be
  specified using a question mark, e.g. ``varchar?``.

- Progress reporting for long-running queries

  Queries that take at least 5 seconds now result in a progress bar being displayed that estimates
  the progress of the computation.

  To enable this for Jupyter notebooks, note that crandas should be installed with the  ``notebook`` dependency flag, see below.
  
    
- Various memory improvements for both server and client
- Large data uploads and downloads are now automatically chunked

  Uploads are processed in batches of size ``crandas.ctypes.ENCODING_CHUNK_SIZE``.

- Added a date column type

  * Dates can now be encoded using the ``date`` ctype.
  * Dates limited between 1901/01/01 - 2099/12/31 for leap year reasons
  * Ability to subtract two dates to get number of days and add days to a date
  * All comparison operators apply for date
  * Added functions to extract ``year``, ``month``, ``day`` and ``weekday``
  * Able to group over dates, merge and filter


- feat(crandas): support ``with_threshold`` for aggregation
  
  This adds support for e.g. ``table["column"].with_threshold(10).sum()``. Before this change,
  ``with_threshold()`` was only supported for filtering operations, e.g.
  ``table[filter.with_threshold(5)]``, and not for aggregation operations (min, max, sum, etc.).