Changelog¶
v1.8.0¶
Major new features include:
Support for bigger (96 bit) integers
Progress bars for running queries and the possibility of cancelling running queries
Memory usage improvements (client & server)
Null value (missing values) support for all column types
Searching strings using regular expressions
Added a date column type
New features¶
Support for columns with bigger (96 bit) integers
Just like in the previous version, integers have the ctype
int
. When specifying the ctype, minimum and maximum bounds for the values can be supplied using themin
andmax
parameters, e.g.int[min=0, max=1000]
. Bounds (strictly) between -2^95 and 2^95 are now supported.For example, to upload a column
"col": [1, 2, 3, 4]
as anint
use the followingctype_spec
:table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": "int[min=1,max=4]"})
as before.
To force usage of a particular modulus the integer ctype accepts the keyword argument
modulus
. For example, to force usage of large integers one can run:from crandas.moduli import moduli table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})
Notes: * crandas will automatically switch to
int[modulus={moduli[128]}]
if the (derived) bounds do not fit in anint32
. * crandas will throw an error if the bounds do not fit in anint96
.We refer to 32-bit integer columns as F64, and 96-bit integer columns as F128, because they are internally represented as 64 and 128 bits numbers, respectively, since we account for a necessary security margin.
Supported features for large integers:
Basic binary arithmetic (
+
,-
,*
,==
,<
,>
,<=
,>=
) between any two integer columnsGroupby and filter on large integers
Unary functions on large integer columns, such as
mean(), var(), sum(), ...
if_else
where the 3 argumentsguard
,ifval
,elseval
may be any integer columnConversion from 32-bit integer columns to large integer columns via
astype
and vice versaVertical concatenation of integer columns based on different moduli
Performing a join on columns based on different moduli
Current limitations:
We do not yet support string conversion to large integers
json_to_val
only allows integers up to int32 yetIntegerList is only defined over F64 yet
Searching strings and regular expressions
To search a string column for a particular substring, use the
CSeries.contains
function:table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]}) only_is_rows = table["col"].contains("is") table[only_is_rows].open()
Regular expressions are also supported, using the new
CSeries.fullmatch
function:import crandas.re table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]}) starts_with_t = table["col"].fullmatch(cd.re.Re("t.*")) table[starts_with_t].open()
Regular expressions support the following operations:
|
: union*
: Kleene star (zero or or more)+
: one or more?
: zero or one.
: any character (note that this also matches non-printable characters)(
,)
: regexp grouping[...]
: set of characters (including character ranges, e.g.,[A-Za-z]
)\\d
: digits (equivalent to[0-9]
)\\s
: whitespace (equivalent to[\\\\ \\t\\n\\r\\f\\v]
)\\w
: alphanumeric and underscore (equivalent to[a-zA-Z0-9_]
)(?1)
,(?2)
, …: substring (given as additional argument toCSeries.fullmatch()
)
Regular expressions are represented by the class
crandas.re.Re
. It uses pyformlang’s functionality under the hood.Efficient text operations for ASCII strings
The
varchar
ctype now has an ASCII mode for increased efficiency with strings that do only contain ASCII characters (no “special” characters; all codepoints <= 127). Before this change, we only supported general Unicode strings. Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings.By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This information (whether or not ASCII mode is used) becomes part of the public metadata of the column, and crandas will give a
ColumnBoundDerivedWarning
to indicate that the column metadata is derived from the data in the column, unlessauto_bounds
is set to True.Instead of auto-detection, it is also possible to explicitly specify the ctype
varchar[ascii]
orvarchar[unicode]
, e.g.:
import crandas as cd
# ASCII autodetected: efficient operations available; warning given
cdf = cd.DataFrame({"a": ["string"]})
# Unicode autodetected: efficient operations not available; warning given
cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})
# ASCII annotated; efficient operations available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})
# Unicode annotated; efficient operations not available; no warning given
cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})
Running computations can now be cancelled
Locally aborting a computation (e.g. Ctrl+C) will now cause it to be cancelled on the server as well.
All column types now support missing values
All ctypes now support a
nullable
flag, indicating that values may be missing. It may also be specified using a question mark, e.g.varchar?
.Progress reporting for long-running queries
Queries that take at least 5 seconds now result in a progress bar being displayed that estimates the progress of the computation.
To enable this for Jupyter notebooks, note that crandas should be installed with the
notebook
dependency flag, see below.Various memory improvements for both server and client
Large data uploads and downloads are now automatically chunked
Uploads are processed in batches of size
crandas.ctypes.ENCODING_CHUNK_SIZE
.Added a date column type
Dates can now be encoded using the
date
ctype.Dates limited between 1901/01/01 - 2099/12/31 for leap year reasons
Ability to subtract two dates to get number of days and add days to a date
All comparison operators apply for date
Added functions to extract
year
,month
,day
andweekday
Able to group over dates, merge and filter
feat(crandas): support
with_threshold
for aggregationThis adds support for e.g.
table["column"].with_threshold(10).sum()
. Before this change,with_threshold()
was only supported for filtering operations, e.g.table[filter.with_threshold(5)]
, and not for aggregation operations (min, max, sum, etc.).