Working with binary data
Crandas offers the ability to store and manipulate binary data as
bytes. In general, strings are more flexible but bytes are more
efficient and some operations are only possible on bytes. For example,
bytes can be used to perform bitwise operators or cryptographic
functions like encryption and hashing. There are also some helper
functions to help with conversions between bytes and strings.
A bytes column can be uploaded using the bytes datatype in Python. For example, all the following columns contain the same data:
byte_data = [b"AAP", b"Noot", b"Mies"]
encoded_data = ["AAP".encode(), "Noot".encode(), "Mies".encode()]
hex_data = [
bytes.fromhex("414150"), bytes.fromhex("4e6f6f74"), bytes.fromhex("4d696573")
]
df = cd.DataFrame({"bytes": byte_data, "encoded": encoded_data, "hex": hex_data})
Bitwise operators
The bitwise operators &, |, ^ and ~ are supported on bytes
columns. They perform the specified operator on all underlying bits
individually. For the binary operators, both operands need to have the
same length.
Cryptographic functions
Beyond the basic operations, bytes columns can be used to perform encryption and hashing.
Encryption
To encrypt data directly using a block cipher, the
crypto.cipher module can be used
as follows:
import crandas.crypto.cipher as cipher
tab_key = cd.DataFrame({"key": [bytes.fromhex("000102030405060708090a0b0c0d0e0f")]})
tab_data = cd.DataFrame({"data": [bytes.fromhex("00112233445566778899aabbccddeeff"), bytes.fromhex("ffeeddccbbaa99887766554433221100")]})
# Encryption
aes_128 = cipher.AES_128(tab_key["key"].as_value()) # Currently AES-128, AES-192 and AES-256 are supported
aes_128.encrypt(tab_data["data"])
# Decryption is currently not supported
Hashing
To hash data, use the crypto.hash module:
import crandas.crypto.hash as hash
tab_data = cd.DataFrame({"data": [b"John Doe", b"Foo Bar"]})
# Regular hash. Currently supported:
# - SHA-3: SHA3-224, SHA3-256, SHA3-384, SHA3-512
# - RIPEMD-160
h = hash.SHA3_256
h.digest(tab_data["data"])
# Extendable-output functions (XOF), which allow for outputs of a custom length:
# - SHAKE: SHAKE128, SHAKE256
# - TurboSHAKE: TurboSHAKE128, TurboSHAKE256
h = hash.SHAKE128
h.digest(tab_data["data"], 32)
h.digest(tab_data["data"], 64)
Keyed hashes are also supported. For the modern hash functions SHA-3 and (Turbo)SHAKE the key can simply be prepended to the message:
tab_key = cd.DataFrame({"key": [bytes.fromhex("bcb25f81807bcb5995c2f663eaeb02f1248de8f3")]})
tab_data = cd.DataFrame({"data": [b"John Doe", b"Foo Bar"]})
# Create a keyed hash function by simply prepending the key for the modern hash functions SHA-3 or (Turbo)SHAKE
h = hash.TurboSHAKE128
h.digest(tab_key["key"].as_value() + tab_data["data"], 32)
h.digest(tab_key["key"].as_value() + tab_data["data"], 64)
However, for old hash functions like RIPEMD-160 prepending the key has security issues and the HMAC mode should be used instead. The HMAC mode is also supported for SHA-3, although this is less efficient than prepending the key to the message.
tab_key = cd.DataFrame({"key": [bytes.fromhex("bcb25f81807bcb5995c2f663eaeb02f1248de8f3")]})
tab_data = cd.DataFrame({"data": [b"John Doe", b"Foo Bar"]})
# HMAC necessary for RIPEMD-160
h = hash.HMAC_RIPEMD_160(tab_key["key"].as_value())
res = h.digest(tab_data["data"])
# HMAC can be used with SHA-3, but this is less efficient than prepending the key
h = hash.HMAC_SHA3_256(tab_key["key"].as_value())
res = h.digest(tab_data["data"])
Conversion from and to strings
Various conversions from and to strings are possible. Encoding strings
as bytes using UTF-8 is possible using
string_col.encode().
Conversion from and to lowercase hexadecimal strings can
be done using
string_col.hex_to_bytes()
and bytes_col.bytes_to_hex().
Finally, base64 encoding and decoding is supported using
string_col.b64decode()
and bytes_col.b64encode().
While strings are generally more powerful than bytes, using bytes is considerably more efficient in both memory and computation. Whenever there is string data in a table that does not require string specific functionality, it might make sense to convert it to bytes before uploading it to the engine.
Generating random bytes
We can generate secret random bytes using the randbytes() function.
If you are trying to generate random integers instead, see here.
from crandas.random import randbytes
# Generate a table with a column of 10 random 16-byte strings
table["bytes"] = randbytes(16, num_rows=10).as_table()
After having learnt how to work with bytes, we will learn how to work with dates in crandas in the next section.