.. _strings: Working with text data ####################### Crandas provides functionalities that work directly with string columns. In this section we will see how strings work in the VDL and what things we can do in crandas with them. Furthermore, we will explore more efficient alternatives that can be used when string-specific functions are not necessary. When working with strings in crandas, the maximum length of the strings is important. All strings are padded to the maximum length (i.e. all strings have the max length). Crandas will derive the maximum length from the data, but you can also specify it when creating a :class:`.CDataFrame`. .. warning:: If you specify a maximum length that is smaller than a your data, your data will be truncated to that length, potentially losing information. To specify the length of the string, you can do the following .. code:: python import crandas as cd # The first two characters represent the country id_data = ["MX6821389", "NL2950357", "NL5348942", "DE1104572"] df = cd.DataFrame( { "ids": id_data, "country": id_data}, ctype={"country": "varchar[2]"}) In the above example, both columns seemingly have the same data, but the ``id_names`` column, only contains the first two characters of the strings. .. _string_ascii: String encoding and efficiency ------------------------------ Crandas supports two types of strings: unicode and ASCII (meaning that all characters in the string have ``0<=ord(ch)<=127``). Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings. .. versionadded:: 1.8 Support for ASCII. By default, crandas autodetects whether or not the more efficient ASCII mode can be used. This information (whether or not ASCII mode is used) becomes part of the public metadata of the column, and crandas will give a ``ColumnBoundDerivedWarning`` to indicate that the column metadata is derived from the data in the column. See :ref:`type_detection` for more information. Instead of deriving the mode from the data, it is also possible to explicitly specify the ctype ``varchar[ascii]`` or ``varchar[unicode]``, e.g.: .. code:: python import crandas as cd # ASCII autodetected: efficient operations available; warning given cdf = cd.DataFrame({"a": ["string"]}) # Unicode autodetected: efficient operations not available; warning given cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]}) # ASCII annotated; efficient operations available; no warning given cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"}) # Unicode annotated; efficient operations not available; no warning given cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"}) String manipulation ------------------- Besides truncating strings, crandas offers the possibility of converting a string into lowercase by using :meth:`.CSeries.lower`. This function will convert any string into lowercase, which can be useful when combining data from different sources, to ensure compatibility. .. code:: python data = cd.DataFrame({"col": ["AAP", "Noot", "Mies", "VuUR", "Vis"]}) data = data.assign(lower_col = data["col"].lower()) .. _matching: String search and regular expressions ------------------------------------- Crandas supports searching in text data via the :meth:`.CSeries.contains` function, that checks whether a given substring exists in a given string, and via the :meth:`.CSeries.fullmatch` function, that checks whether the string matches a given regular expression. For searching, the substring to search for can be given as a string (in which case the servers learn which string is searched for); or as a :class:`.crandas.CSeries` (in which case this information will remain hidden). In the latter case, the function :meth:`.CSeriesColRef.as_value` can be used to interpret a single-row column as a value that can be used in function definitions, e.g.: .. code:: python import crandas as cd cdf = cd.DataFrame({"haystack": ["abcdefghijk", "lmnopqrstuvw"]}) # Search; substring becomes known to servers needle = "vwx" cdf.assign(found=cdf["haystack"].contains(needle)).open() # Search; substring remains hidden from servers but same for all rows cneedle = cd.DataFrame({"needle": [needle]})["needle"].as_value() cdf.assign(needle=cneedle, found=lambda x: x.haystack.contains(x.needle)).open() # Private search (different substring per row) cneedles = cd.DataFrame({"needle": ["abcde", "klm"]}) cdf.assign(needle=cneedles["needle"], found=lambda x: x.haystack.contains(x.needle)).open() For regular expression matching, a regular expression can be given as documented in :class:`.re.Re`. The regular expression can contain references to other columns to require that the (the respetive row of the) given column should occur in the regular expression at the given place: .. code:: python import crandas as cd import crandas.re # Simple example cdf = cd.DataFrame({"a": ["has a!", "has bb!", "has xx?!"], "b": ["", "b", "xxx"], "c": ["a", "c", "c"]}) re = crandas.re.Re(r"has (a|bb).*") cdf["a"].fullmatch(re, cdf["b"], cdf["c"]).as_table().open() # Example with substrings via (?1) cdf = cd.DataFrame({"a": ["has a!", "has bb!", "has xx?!"], "b": ["", "bb", "xxx"], "c": ["a", "c", "c"]}) re = crandas.re.Re(r"has ((?1)|(?2)).*") cdf["a"].fullmatch(re, cdf["b"], cdf["c"]).as_table().open() Strings and bytes ---------------------------- Whenever there is string data in a table that does not require string functionalities, it might make sense to convert it to bytes before uploading it to the VDL, as byte encoding is considerably more efficient. To specify a list of bytes rather than strings, you would do the following: .. code:: python string_data = ["AAP", "Noot", "Mies", "VuUR", "Vis"] # This converts data to bytes byte_data = [s.encode() for s in string_data] df = cd.DataFrame({"strings": string_data, "bytes": byte_data}) While strings are more natural, when you upload large databases, using byte columns will make a difference in speed and efficiency. After learning the things we can do with strings, we will learn how to work with dates in the crandas in the :ref:`next section`.