Working with text data#

crandas provides functionalities that work directly with string columns. In this section we will see how strings work in the VDL and what things we can do in crandas with them. Furthermore, we will explore more efficient alternatives that can be used when string-specific functions are not necessary.

Note

Currently we only support UTF8 encoding for strings. This is the most common string encoding, but always make sure that your data is correctly formatted before uploading it to the VDL.

When working with strings in crandas, the maximum length of the strings is important. All strings are padded to the maximum length (i.e. all strings have the max length). crandas will derive the maximum length from the data, but you can also specify it when creating a CDataFrame.

Warning

If you specify a maximum length that is smaller than a your data, your data will be truncated to that length, potentially losing information.

To specify the length of the string, you can do the following

import crandas as cd

date_data = ["2015-10-10", "2021-25-01", "2020-01-02", "2021-12-33"]

df = cd.DataFrame( { "date": date_data, "year": date_data}, ctype={"year": "varchar[4]"})

In the above example, both columns seemingly have the same data, but the year column, only contains the first four characters of the strings (which conveniently happens to be the year).

String manipulation#

Besides truncating strings, crandas offers the possibility of converting a string into lowercase by using CSeries.lower(). This function will convert any string into lowercase, which can be useful when combining data from different sources, to ensure compatibility.

data = cd.DataFrame({"col": ["AAP", "Noot", "Mies", "VuUR", "Vis"]})

data = data.assign(lower_col = data["col"].lower())

Strings and bytes#

Whenever there is string data in a table that does not require string functionalities, it might make sense to convert it to bytes before uploading it to the VDL, as byte encoding is considerably more efficient.

To specify a list of bytes rather than strings, you would do the following:

string_data = ["AAP", "Noot", "Mies", "VuUR", "Vis"]

# This converts data to bytes
byte_data = [s.encode() for s in string_data]

df = cd.DataFrame({"strings": string_data, "bytes": byte_data})

While strings are more natural, when you upload large databases, using byte columns will make a difference in speed and efficiency.

After learning the things we can do with strings, we will see what happens when we are working with missing data in the next section.