Changelog¶
1.12.2¶
Crandas¶
Fix a bug, where reducing the bounds (e.g., using
astype
) breaks the server when it tries to switch to a smaller internal representation.Allow fixedpoint ctype specification to define min and max as floats. For example:
table = cd.DataFrame({"vals": [0.75, 1.25, 2.25]}, ctype={"vals": "fp[min=0.5,max=2.5]"})
Add
cd.stats.contingency.crosstab
for computing contingency tablesAdd
crandas.CSeries.as_series
to obtain aReturnValue
for the seriesAdd
crandas.stats.chi2_contingency
for performing chisquare test on contingency tableAdd
crandas.stats.contingency.expected_freq
for computing expected frequencesFix
crandas.stats.ttest_ind()
to work with unequal variances for larger dataFix
crandas.stats.chisquare
to provide better error message when input contains zerosAdded the
crandas.compat
,crandas.crypto
andcrandas.stats
packages in pyproject.toml via automatic package discovery.
1.12.1¶
Crandas¶
Fix a bug in fixed-point multiplication; in certain cases this bug caused incorrect fixed-point multiplication results.
Fix an issue related to the printing of the min and max value of a fixed-point ctype
Fix bug where, in some cases, a schema mismatch between design and production would cause an exception rather than a user-friendly error message
1.12.0¶
Crandas¶
Major performance improvements for comparisons and related operations like
sort_values
andgroupby
. The performance of these operations is improved by around a factor 10.Fix bugs in
substitute
in the following edge-cases:Fix crash when the
output_size
parameter is set too high.Fix potential invalid substitutions when Unicode substitutions are applied to ASCII strings.
Add support for fixed-point arrays, including array functions
inner
andvsum
.Add support for
cd.cut
on fixed-point values, bins and labels. For example:table = cd.DataFrame({"vals":[1.1, 2.1, 2.2, 3.3]}, ctype={"vals":"fp[precision=30]"}) table = table.assign(cut=cd.cut(table["vals"], [1.5,2.5], labels=[0.1,0.2,0.3], add_inf=True))
Fix bug which prevented uploading bytes columns with more than 100 000 values.
Add support for Base64 encoding and decoding:
bytes_col.b64encode()
andstring_col.b64decode()
:table = cd.DataFrame({"base64": ["TWFu", "bGlnaHQgd29yaw=="]}, auto_bounds=True) table = table.assign(bytes=table["base64"].b64decode()) # table["bytes"] set to [b"Man", b"light work"] table["bytes"].b64encode() # equal to table["base64"]
Add support for performing a groupby on more than 100 000 unique values.
Add support for computing the square root
cdf["col"].sqrt()
Add module
crandas.stats
for performing a number of statistical testsoverride_version_check
can now be supplied to the Session constructor, socd.connect("connection-file", override_version_check=False)
now works. Additionally, theCRANDAS_OVERRIDE_VERSION_CHECK
environment variable was moved to thecd.config
framework.Support
mode
and other query arguments forCSeries.sum()
, e.g. so that the following now works without opening valuestable = cd.DataFrame({"values": [1, 2, 3]}) x = (table["values"] ** 2).sum(mode="regular") # x is not opened
Improve support for using multiple sessions at the same time with the
session
query argument and the use of a session as a context handlerBREAKING: A check is introduced to ensure that objects from different sessions are not mixed. (Previously, if multiple sessions to the same endpoint were used, it was possible to mix objects.)
1.11.0¶
Crandas¶
Preserve len(df) of pandas DataFrames without columns
Support for the concatenation of strings. For example:
table = cd.DataFrame({"first_name": ["John", "Jan"], "last_name": ["Doe", "Jansen"]}, auto_bounds=True) full_names = table["first_name"] + " " + table["last_name"] full_names.open()
Add support for
upper()
, similar to the existinglower()
. In addition, it is now also possible to only change the case of specific indices:table = cd.DataFrame({"name": ["john", "Jansen"]}, auto_bounds=True) table["name"].upper([0]).open() # Returns ["John", "Jansen"] table["name"].upper([1, 3, 5]).open() # Returns ["jOhN", "JAnSeN"]
crandas.tool.check_connection
now cleans up tables after running a dummy computationPerform parquet uploads through
read_parquet
Improved implementation of
cd.merge
:Add support for right and one-to-many joins. Now, any combination of left/right/inner/outer one-to-one/one-to-many/many-to-one joins is supported.
Improved support for nullable columns. When performing a non-inner join, now nullable columns are output (previously, a non-nullable column with the value zero was returned in some cases)
Deal with key columns with different colum names consistently with pandas, e.g., when joining with
left_on="a"
andright_on="b"
, return two columns with the left and right key values, respectivelyAdd support for suffixes
Improved progress reporting for one-to-many joins
Allow trivial grouping-based join where the right table only has key columns
BREAKING: The signature of the
cd.merge
function has not changed, but, because of the above changes, the resulting table may have a different set of columns and/or columns of different types than in previous versions. Moreover, the underlying VDL API command has been renamed and changed internally so that existing authorizations do not apply to the new merge. The old merge is available via crandas ascd.compat.merge_v1
.Fix bug in bytes column when using non-ASCII characters. Opening such values could give incorrect results.
Add support for (fixed-point) division (
/
). For example:table = cd.DataFrame({"num": [1.2, 3.4, 5.6], "denom": [2.1, 4.3, 5.4]}, auto_bounds=True) table = table.assign(div = lambda x: x.num / x.denom) table = table.assign(rec_num = lambda x: 1 / x.num) # computing the reciprocal of num table.open()
Add support for
strip()
on string columns, which will remove the leading and trailing spaces.Add support for floor_division (
//
).Add support for RIPEMD-160:
import crandas.crypto.hash as hash tab = cd.DataFrame({"a": [b"Test 1", b"Test 2"]}, auto_bounds=True) h = hash.RIPEMD_160 h.digest(tab["a"]).open() # HMAC is also supported tab_key = cd.DataFrame({"key": [bytes.fromhex("0123456789abcdef0123456789abcdef01234567")]}, auto_bounds=True) hmac = hash.HMAC_RIPEMD_160(tab_key["key"].as_value()) hmac.digest(tab["a"]).open()
Add support for AES encryption:
import crandas.crypto.cipher as cipher table = cd.DataFrame({"a": [bytes.fromhex("00112233445566778899aabbccddeeff")] * 2}, ctype={"a": "bytes[16]"}) tab_key = cd.DataFrame({"key": [bytes.fromhex("000102030405060708090a0b0c0d0e0f")]}, ctype={"key": "bytes[16]"}) aes_128 = cipher.AES_128(tab_key["key"].as_value()) aes_128.encrypt(table["a"]).open()
Add support for
len
and slicing on bytes columns:bytes_col.len()
andbytes_col[16:32]
Add support for conversion to and from lowercase hex strings:
string_col.hex_to_bytes()
andbytes_col.bytes_to_hex()
Add support for encoding ASCII strings as bytes:
ascii_col.encode()
Add bitwise operations on bytes columns: AND (
&
), OR (|
), XOR (^
), NEGATE (~
)Add support for string substitution:
table = cd.DataFrame({"a": ["PÆR", "á"]}, auto_bounds=True) table["a"].substitute({"a": ["á", "à", "ä"], "AE": ["Æ"]}, output_size=4).open()
Add support for filtering characters:
table = cd.DataFrame({"a": ["Test string", "More"]}, auto_bounds=True) table["a"].filter_chars(["a", "e", "i", "o", "u"]).open()
Add support for reading stored analyst, approver, and server keys in PEM format
Fix bug where uploading a series with only NULL values would give an error
Fix bug where
repr(cdf)
,str(cdf)
would not deal correctly with zero-row dataframesWe move
auto_bounds
from aSession
property to be a configuration variable (usingcrandas.config.settings.auto_bounds
). Having this variable set to True suppresses data-derived column bound warnings by default. Each session object now has a deprecatedauto_bounds
property that gets/sets the configuration variable.BREAKING: This breaks the possibility of a user having two concurrent sessions with different
auto_bounds
values set.Allow to provide
pd.read_csv
arguments (e.g.,delimiter
) as arguments tocd.read_csv
Warn user when calling crandas in a conditional context (e.g., an
if
statement) during script recording. See documentation of thecrandas.check_recording
module for details.Warn users to specify a
validate
argument when usingmerge
during script recording. See documentation of thecrandas.check_recording
module for details.Allow to specify
ctype
as argument tocd.Series
Expose
ctype
andschema
of a column as properties of the classesCol
(cdf.columns.cols[ix]
) andCSeriesColRef
(cdf["col"]
); and of aCDataFrame
Add function
CDataFrame.astype()
that converts the type of a individual columns (viactype
parameter) or the full CDataFrame (viaschema
parameter)Add
schema
parameters toupload_pandas_dataframe
,read_csv
,DataFrame
,read_parquet
functions. Forctype
parameter, warn if the corresponding column does not existAdd functions
pandas_dataframe_schema
andread_csv_schema
that return the schema corresponding to a DataFrame or CSV fileA server-side schema check for get_table is introduced. When get_table is used in a script, the schema of the resulting table is stored in the recorded script. When the script is used, a server-side check for adherence to the schema is performed.
BREAKING: using get_table in a script where the tables do not match between recording and using the script, now produces an error; see documentation of
get_table
for detailsAdd tilde expansion to cd.base.Session.connect()
Improved error messages for: using
get_table
on a non-dummy handle in script recording; invalid arguments tocut
, e.g. non-integer bins or labels; sending unauthorized queries where authorization is needed; invalidhow
argument tomerge
; use ofNone
-like values in functions (e.g.,x.if_else(y, None)
); use of unknown ctypes (e.g.,ctype={"a": "str"}
); uploading fixed-point columns where integers may be intended (e.g., uploadingpd.Series([1, 2, None])
)Fix bug where the use of a value placeholder (e.g.,
cdf.assign(b=lambda x: x.a + cd.placeholders.Any(1))
) would in many cases not work
1.10.2¶
This is a bugfix release.
Crandas¶
We update the pyformlang dependency to fix bugs in character ranges
1.10.1¶
This is a bugfix release.
Crandas¶
We give better errors when receive unexpected responses from the server.
We fix a performance regression of the groupby operation, when it is performed on a single F64 column. It is now again as fast as in version 1.9.
1.10.0¶
The major new feature is expanded support for fixed-point columns.
Crandas¶
Expanded support for fixed-point columns:
Fixed point columns now support larger range and precision (96 bits).
Fixed point columns now support various statistical functions (
min()
,max()
,sum()
,sum_squares()
,mean()
,var()
).Support for arithmetic operations between two fixed point columns, and between fixed-point and integer columns is added. (NB: we do not yet support division; this will be added in a later release.)
Support for concatenation of integer and fixed point columns (resulting in a fixed-point column) is added.
Support for join and filtering on fixed point columns is added.
Parsing of floats on column operations used in operations as filters or assign is supported.
The new
dropna
function removes rows with any missing values from a CDataFrame.The new
save
can be used to save an object such as a CDataFrame. If persistence is enabled on the server, this means that the object is kept across server restarts. Thesave
command may also be used to attach a name to a computed table, e.g.table.save(name="my_table")
.The connection file and
Session
now both have an optionalapi_token
property. This is sent to the server and may be used for authentication purposes.The functions
obj.remove()
andcd.remove_objects()
have been changed to provide more information in case non-existent object(s) are removed.Support for division is added.
BREAKING: when removing multiple objects using
cd.remove_objects(lst)
, the new behavior is to try to remove all objects even if errors are encountered. The old behavior was to abort on the first error. See the documentation for details.
1.9.2¶
1.9.1¶
Crandas¶
No changes.
1.9.0¶
Crandas¶
The
Session
object now has two settings modes, depending on whether a VDL connection file is used (recommended method), or whether the endpoint, certificate, and server public keys are specified manually (legacy method). These are reflected in thesettings_mode
attribute of theSession
object.When
endpoint
is set by the user, theSession
is set to legacy mode; otherwise, the connection file method is assumed. When the user does not configure anything, the default is to load thedefault.vdlconn
file, residing in the configuration folder (default:~/.config/crandas
, overridable by theCRANDAS_HOME
environment variable). The namedefault.vdlconn
can be overriden through thedefault_connection_file
variable. If that file is not present, scan the configuration folder for files with the extension.vdlconn
. If there is a single file, use that. If there are multiple, raise an error.analyst_key
is now a read-write property that returns the nacl SigningKey, and can be set to either a SigningKey, a filename, a path, or None. When set to None, the default key will be loaded. Both the default key file, and the default relative path, depend on the settings mode. For connection file mode, it isanalyst.sk
and the current working directory in case of a path (Path, or a string that includes a slash “/”); in case of a filename (string that does not include a slash), it is assumed to reside in the configuration folder; for legacy mode it isclientsign.sk
and the base_path (to maintain backwards compatibility).Besides the
Session
object, which is used to configure the connection to the VDL, we introduce Dynaconf for user configuration for settings that are not directly related to the connection. The new method provides an easy way for the user to set variables, either using code, using environment variables, or using a settings file (default:settings.toml
in the same configuration folder referred to above).We make displaying progress bars configurable using the
show_progress_bar
andshow_progress_bar_after
(for the delay in seconds) variables.To make the configuration folder and display the folder in the user’s file browser, the user can now call
python -m crandas config
.We support the
Any
placeholder forget_table
We support
stepless
mode in scripts, that can be manually enabled to removescript_step
numbers from certain queries. This can be useful together with theAny
placeholder, to have queries that can be executed a variable number of times.Add a
map_dummy_handles
override in call toget_table
In
CDataFrame.assign
, we now support the use of colum names that correspond to VDL query arguments (e.g. “name”, “bitlength”).BREAKING: existing scripts that use these VDL query arguments will now give an error message explaining how these arguments should be specified. Existing authorizations are not affected.
Add support for the following operators in regular expressions:
{n}
: match exactly n times{min,}
: match at least min times{,max}
: match at most max times{min,max}
: match at least min and at most max times
Support was added to disable HTTP Keep-Alive in connections to the VDL server. This can help solve connection stability issues. Keep-Alive can be disabled in the connection file by setting
keepalive = false
. The setting can be overriden by the user by using thekeepalive
parameter ofcrandas.connect
.Add
sort_values
function to aCDataFrame
, which sorts the dataframe according to a column. Example:cdf = cd.DataFrame({"a": [3, 1, 4, 5, 2], "b": [1, 2, 3, 4, 5]}, auto_bounds=True) cdf = cdf.sort_values("a")
Currently, sorting on strings is not supported.
Add support for groupby on multiple columns and on all non-nullable column types.
For example, this is now possible:
cdf = cd.DataFrame({"a": ["foo", "bar", "foo", "bar"], "b": [1, 1, 1, 2]}, auto_bounds=True) tab = cdf.groupby(["a", "b"]).as_table() sorted(zip(tab["a"].open(), tab["b"].open()))
The parameter name of the groupby is renamed from
col
tocols
to reflect these changes. Currently, a maximum of around 100 000 unique values are supported. Above that, the groupby will fail and give an error message. Note that this is the number of unique values. The number of rows can be significantly higher as long as there are less than 100 000 different values in the groupby column(s). Furthermore, a consequence of the new implementation is that the output is not order-stable anymore but random.Add k-nearest neighbors functionality. This allows the target value of a new data point to be predicted based on the existing data using its k nearest neighbors. Example:
import crandas as cd from crandas.crlearn.neighbors import KNeighborsRegressor X_train = cd.DataFrame({"input": [0, 1, 2, 3]}, auto_bounds=True) y_train = cd.DataFrame({"output": [0, 0, 1, 1]}, auto_bounds=True) X_test = cd.DataFrame({"input": [1]}, auto_bounds=True) neigh = KNeighborsRegressor(n_neighbors=3) neigh.fit(X_train, y_train) neigh.predict_value(X_test)
For more information, see
crandas.crlearn.neighbors.KNeighborsRegressor
.Add a new aggregator
crandas.groupby.any
that takes any value from the set of values and is faster thancrandas.groupby.max
/crandas.groupby.min
In the HTTP connection to the VDL server, use retries for certain HTTP requests to improve robustness
Add
created
property to dataframes and other objects indicating the date and time when they were uploaded or computedHandle cancellation of a query by raising a
QueryInterruptedError
. This replaces the previous behaviour of returningNone
and printing “Computation cancelled”. In ipython, the “Computation cancelled” message is still shown.In the progress bar for long-running computations, show “no estimate available yet” as long as progress is at 0% (instead of a more cryptic notation).
Add functionality to list uploads to the VDL. For more information, see:
crandas.stateobject.list_uploads
andcrandas.stateobject.get_upload_handles
.
1.8.1¶
Crandas fixes¶
crandas.get_table()
now ensuresconnect()
is called firstFix upload and decoding of positive numbers of 64 bits In Crandas, trying to upload and download numbers of in the range
R = [2^{63}, 2^{64} -1]
would previously fail. We fix this issue by mimicking pandas behavior. That is, a number in the rangeR
is returned as annp.uint64
. Secondly, w.r.t. uploading,np.uint64
,np.uint32
, andnp.uint16
are now recognized as integers.
1.8.0¶
Major new features include:
Support for bigger (96 bit) integers
Progress bars for running queries and the possibility of cancelling running queries
Memory usage improvements (client & server)
Null value (missing values) support for all column types
Searching strings using regular expressions
Added a date column type
New features¶
Support for columns with bigger (96 bit) integers
Just like in the previous version, integers have the ctype
int
. When specifying the ctype, minimum and maximum bounds for the values can be supplied using themin
andmax
parameters, e.g.int[min=0, max=1000]
. Bounds (strictly) between -2^95 and 2^95 are now supported.For example, to upload a column
"col": [1, 2, 3, 4]
as anint
use the followingctype spec
:table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": "int[min=1,max=4]"})
as before.
To force usage of a particular modulus the integer ctype accepts the keyword argument
modulus
, which can be set to either of the moduli that are hardcoded incrandas.moduli
. For example, to force usage of large integers one can run:from crandas.moduli import moduli table = cd.DataFrame({"col":[1, 2, 3, 4]}, ctype={"col": f"int[min=1,max=4,modulus={moduli[128]}]"})
Notes:
crandas will automatically switch to
int[modulus={moduli[128]}]
if the (derived) bounds do not fit in anint32
.crandas will throw an error if the bounds do not fit in an
int96
.
We refer to 32-bit integer columns as F64, and 96-bit integer columns as F128, because they are internally represented as 64 and 128 bits numbers, respectively, since we account for a necessary security margin.
Supported features for large integers:
Basic binary arithmetic
(+, -, *, ==, <, >, <=, >=)
between any two integer columnsGroupby and filter on large integers
Unary functions on large integer columns, such as
mean(), var(), sum(), ...
if_else
where the 3 argumentsguard
,ifval
,elseval
may be any integer columnConversion from 32-bit integer columns to large integer columns via
astype
and vice versaVertical concatenation of integer columns based on different moduli
Performing a join on columns based on different moduli
Current limitations:
We do not yet support string conversion to large integers
json_to_val
only allows integers up to int32 yetIntegerList is only defined over F64 yet
Changes:
base.py
: deprecatedsession.modulus
crandas.py
: classCol
andReturnValue
present also themodulus
ctypes.py
:added support to encode/decode integers of 128 bits
made ctype class decoding modulus dependent
input.py
:mask
andunmask
are now dependent on the modulusplaceholders.py
: class Masker now also contains a modulusNEW FILE
moduli.py
: containing the default moduli for F64 as well as F128.
Searching strings and regular expressions
To search a string column for a particular substring, use the
CSeries.contains
function:table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]}) only_is_rows = table["col"].contains("is") table[only_is_rows].open()
Regular expressions are also supported, using the new
CSeries.fullmatch
function:import crandas.re table = cd.DataFrame({"col": ["this", "is", "a", "text", "column"]}) starts_with_t = table["col"].fullmatch(cd.re.Re("t.*")) table[starts_with_t].open()
Regular expressions support the following operations:
|
: union*
: Kleene star (zero or or more)+
: one or more?
: zero or one.
: any character (note that this also matches non-printable characters)(
,)
: regexp grouping[...]
: set of characters (including character ranges, e.g.,[A-Za-z]
)\\d
: digits (equivalent to[0-9]
)\\s
: whitespace (equivalent to[\\\\ \\t\\n\\r\\f\\v]
)\\w
: alphanumeric and underscore (equivalent to[a-zA-Z0-9_]
)(?1)
,(?2)
, …: substring (given as additional argument toCSeries.fullmatch()
)
Regular expressions are represented by the class
crandas.re.Re
. It uses pyformlang’s functionality under the hood.Efficient text operations for ASCII strings
The
varchar
ctype now has an ASCII mode for increased efficiency with strings that do only contain ASCII characters (no “special” characters; all codepoints <= 127). Before this change, we only supported general Unicode strings. Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings.By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This information (whether or not ASCII mode is used) becomes part of the public metadata of the column, and crandas will give a
ColumnBoundDerivedWarning
to indicate that the column metadata is derived from the data in the column, unlessauto_bounds
is set to True.Instead of auto-detection, it is also possible to explicitly specify the ctype
varchar[ascii]
orvarchar[unicode]
, e.g.:import crandas as cd # ASCII autodetected: efficient operations available; warning given cdf = cd.DataFrame({"a": ["string"]}) # Unicode autodetected: efficient operations not available; warning given cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]}) # ASCII annotated; efficient operations available; no warning given cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"}) # Unicode annotated; efficient operations not available; no warning given cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})
Running computations can now be cancelled
Locally aborting a computation (e.g. Ctrl+C) will now cause it to be cancelled on the server as well.
Rename crandas.query to crandas.command to be consistent with server-side implementation and to differentiate from the new crandas.queries module
Add module crandas.queries providing client-side implementation of the task-oriented VDL query API, and use this for all queries performed via vdl_query. To perform queries, a block-then-poll strategy is used where first, a blocking query with a timeout of 5 seconds is performed, and if the result is not ready then, status update polls are done at a 1 second interval
All column types now support missing values
All ctypes now support a
nullable
flag, indicating that values may be missing. It may also be specified using a question mark, e.g.varchar?
.Progress reporting for long-running queries
Queries that take at least 5 seconds now result in a progress bar being displayed that estimates the progress of the computation.
To enable this for Jupyter notebooks, note that crandas should be installed with the
notebook
dependency flag, see below.Various memory improvements for both server and client
Large data uploads and downloads are now automatically chunked
Uploads are processed in batches of size
crandas.ctypes.ENCODING_CHUNK_SIZE
.Added a date column type
Dates can now be encoded using the
date
ctype.Dates limited between 1901/01/01 - 2099/12/31 for leap year reasons
Ability to subtract two dates to get number of days and add days to a date
All comparison operators apply for date
Created functions for
year
,month
,day
andweekday
Able to group over dates, merge and filter
New ctype
DateCtype
converts strings (throughpd.to_datetime
) and python dates (datetime.date
,datetime64
andpd.timestamp
) into crandas datesHelper subclass of
CSeries
_DT
allows for pandas-style calling of date retrieval functions (col.dt.year
) and standard calls (col.year
).
Crandas¶
New dependencies:
tqdm
andpyformlang
New dependency flag:
notebook
, for features related to Jupyter notebooks. Usepip install crandas[notebook]
to install these.Dependency urllib3 is updated to ensure ‘assert_hostname = False’ does work as expected
Documentation updates
Recording or loading a new script when there is already another script active now no longer gives an error, but a warning message is printed instead.
feat(crandas): support with_threshold for aggregation
This adds support for e.g.
table["column"].with_threshold(10).sum()
. Before this change,with_threshold()
was only supported for filtering operations, e.g.table[filter.with_threshold(5)]
, and not for aggregation operations (min, max, sum, etc.).Note that the alternative that worked before
table["column"].sum(threshold=5)
is still supported, for both aggregation and filtering operations.Minor change: supplying both with_threshold() and a threshold argument now raises a ValueError instead of a TypeError when these are different.
implement setter for base_path
The crandas
Session
objects now supports settingbase_path
to either a string, a Path, or None. Retrieving the property will always return a Path.Fix problem where calling size() on a groupby object would fail for int32 columns
Improved message for auto-determined bounds
Collect all auto_bounds warnings from a data upload into a single warning message
Allow to set auto_bounds globally in crandas.base.session