Working with binary data

Crandas offers the ability to store and manipulate binary data as bytes. In general, strings are more flexible but bytes are more efficient and some operations are only possible on bytes. For example, bytes can be used to perform bitwise operators or cryptographic functions like encryption and hashing. There are also some helper functions to help with conversions between bytes and strings.

A bytes column can be uploaded using the bytes datatype in Python. For example, all the following columns contain the same data:

byte_data = [b"AAP", b"Noot", b"Mies"]
encoded_data = ["AAP".encode(), "Noot".encode(), "Mies".encode()]
hex_data = [bytes.fromhex("414150"), bytes.fromhex("4e6f6f74"), bytes.fromhex("4d696573")]

df = cd.DataFrame({"bytes": byte_data, "encoded": encoded_data, "hex": hex_data})

Bitwise operators

The bitwise operators &, |, ^ and ~ are supported on bytes columns. They perform the specified operator on all underlying bits individually. For the binary operators, both operands need to have the same length.

Cryptographic functions

Beyond the basic operations, bytes columns can be used to perform encryption and hashing.

Encryption

To encrypt data directly using a block cipher, the crandas.crypto.cipher module can be used as follows:

import crandas.crypto.cipher as cipher

tab_key = cd.DataFrame({"key": [bytes.fromhex("000102030405060708090a0b0c0d0e0f")]})
tab_data = cd.DataFrame({"data": [bytes.fromhex("00112233445566778899aabbccddeeff"), bytes.fromhex("ffeeddccbbaa99887766554433221100")]})

# Encryption
aes_128 = cipher.AES_128(tab_key["key"].as_value()) # Currently AES-128, AES-192 and AES-256 are supported
aes_128.encrypt(tab_data["data"])

# Decryption is currently not supported

Hashing

To hash data using either a hash function directly or by using the HMAC mode, the crandas.crypto.hash module can be used:

import crandas.crypto.hash as hash

tab_key = cd.DataFrame({"key": [bytes.fromhex("bcb25f81807bcb5995c2f663eaeb02f1248de8f3")]})
tab_data = cd.DataFrame({"data": [b"John Doe", b"Foo Bar"]})

# Regular hash
h = hash.RIPEMD_160 # Currently only RIPEMD-160 is supported
h.digest(tab_data["data"])

# HMAC
h = hash.HMAC_RIPEMD_160(tab_key["key"].as_value())
res = h.digest(tab_data["data"])

Conversion from and to strings

Various conversions from and to strings are possible. Encoding ASCII strings as bytes is possible using the string_col.encode(). Conversion from and to lowercase hexadecimal strings can be done using string_col.from_hex() and bytes_col.to_hex(). Finally, base64 encoding and decoding is supported using string_col.b64decode() and bytes_col.b64encode().

While strings are generally more powerful than bytes, using bytes is consederably more efficient in both memory and computation. Whenever there is string data in a table that does not require string specific functionality, it might make sense to convert it to bytes before uploading it to the engine.

After learning the things we can do with bytes, we will learn how to work with dates in crandas in the next section.