.. _strings:

Working with text data
#######################

Crandas provides functionalities to work directly with string columns (i.e. columns containing text). In this section we will see how to work with text and what functionalities are available. Furthermore, we will explore more efficient alternatives that can be used when string-specific functions are not necessary.

When working with strings in crandas, the maximum length of the strings is important.
Because the length of a string is kept private, all strings are stored on the server as padded strings of the maximum length, and this maximum length is hence known to all parties.
When uploading a string column, it is recommended to specify this length manually depending on your data model, to avoid revealing the actual maximum length of your strings [1]_.
If you don't, crandas will automatically derive the maximum length from your data.

To specify the length of the string, you can do the following:

.. code:: python

    import crandas as cd

    # 'strings' is a column with maximum length 6
    table = cd.DataFrame({"strings": ["a", "bunch", "of", "words"]}, ctype={"strings": "varchar[6]"})

    # here, 'strings' is a column with maximum length 5 based on the data;
    # crandas will warn the user that the maximum length is revealed
    automatic = cd.DataFrame({"strings": ["a", "bunch", "of", "words"]})

.. warning::

    If you specify a maximum length that is smaller than a your data, your data will be truncated to that length, potentially losing information.

.. [1] The maximum length of your strings may reveal confidential information. For example, suppose another party has uploaded a list of names of persons, and you have a person of interest with a long name that may or may not be present in the dataset. If the length of this name is longer than the length of all person names, you know that this person is not present in the dataset. To avoid this issue, prior to uploading data, the other party should consider a reasonable maximum length of names of persons, and supply this length manually.


.. _string_ascii:

String encoding and efficiency
------------------------------

Crandas supports two types of strings: unicode and ASCII (meaning that all characters in the string have ``0 <= ord(ch) <= 127``).
Certain operations (in particular, comparison, searching, and regular expression matching), are more efficient for ASCII strings.

.. versionadded:: 1.8

    Support for ASCII.

By default, crandas autodetects whether or not the more efficient ASCII mode can be used.
This information (whether or not ASCII mode is used) becomes part of the public metadata of the column,
and crandas will give a ``ColumnBoundDerivedWarning`` to indicate that the column metadata is derived
from the data in the column. See :ref:`type_detection` for more information.

Instead of deriving the mode from the data, it is also possible to explicitly specify the ctype ``varchar[ascii]`` or ``varchar[unicode]``, e.g.:

.. code:: python

    import crandas as cd

    # ASCII autodetected: efficient operations available; warning given
    cdf = cd.DataFrame({"a": ["string"]})

    # Unicode autodetected: efficient operations not available; warning given
    cdf = cd.DataFrame({"a": ["stri\U0001F600ng"]})

    # ASCII annotated; efficient operations available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[ascii]"})

    # Unicode annotated; efficient operations not available; no warning given
    cdf = cd.DataFrame({"a": ["string"]}, ctype={"a": "varchar[unicode]"})

String manipulation
-------------------

Besides truncating strings, crandas offers the possibility of converting a string into lowercase or uppercase by using :meth:`.CSeries.lower` and :meth:`.CSeries.upper` respectively.
These functions will convert any string into lowercase/uppercase, which can be useful when combining data from different sources, to ensure compatibility.
Additionally, it is possible to transform characters in a particular location by using the ``indices`` argument.

.. code:: python

    import crandas as cd

    data = cd.DataFrame({"col": ["UPPER", "Capitalized", "MiXEd"]})

    data = data.assign(lower_col = data["col"].lower())
    data = data.assign(upper_col = data["col"].upper(indices=[1,3]))

It is also possible to substitute characters for other characters or even longer strings by using the :meth:`.Cseries.substitute` method.

.. code:: python

    table = cd.DataFrame({"a": ["PÆR", "á"]}, auto_bounds=True)
    # In this example, every instance of "á", "à" or "ä" is replaced by "a"
    table = table.assign(no_accents = table["a"].substitute({"a": ["á", "à", "ä"], "AE": ["Æ"]}, output_size=4))

    # We can also substitute one character for multiple, but we then need to determine the new string length
    # In this case, we know that there is only one appearance of Æ per string, so we make the size one longer
    table = table.assign(ae = table["no_accents"].substitute({"AE": ["Æ"]}, output_size=4))


In order to strip the leading and trailing spaces crandas offers the functionality :meth:`.Cseries.strip`.

.. code:: python

    import crandas as cd

    data = cd.DataFrame({"col": ["  left and right ", "    left", "right   ", "no effect"]})

    data = data.assign(stripped = data["col"].strip())

Searching for strings
---------------------

A common operation is to check whether a string is present in a table.
If the strings are unique, you can do a join.
This is likely the fastest method, and it keeps the search string private.

.. code:: python

    import crandas as cd

    table = cd.DataFrame({"s": ["a", "bunch", "of", "strings"]})
    search = cd.DataFrame({"needle": ["strings"]})

    # This table will either be empty or contain a single matching row
    cd.merge(table, search, left_on="s", right_on="needle")

This also works to do *private set intersection*, i.e. check which strings are present in two tables.

.. code:: python

    left = cd.DataFrame({"s": ["a", "bunch", "of", "strings"]})
    right = cd.DataFrame({"s": ["strings", "bunch"]})

    # This table will contain all strings that are present in both tables
    cd.merge(left, right, on="s")

If the strings are not unique, you can make the strings unique by using :meth:`CSeries.groupby`.

.. code:: python

    table = cd.DataFrame({"s": ["to", "be", "or", "not", "to", "be"]})
    unique_strings = table.groupby("s").as_table()
    search = cd.DataFrame({"needle": ["to"]})
    cd.merge(unique_strings, search, left_on="s", right_on="needle")

This does not work if ``table`` contains other columns that you also want in the search result.
In that case, if the single public search string, you can simply use the equality operator:

.. code:: python

    table = cd.DataFrame({"s": ["to", "be", "or", "not", "to", "be"]})
    # the string to be searched for will be public, as well as the number of matches
    matching = table[table["s"] == "not"]

This approach works a bit differently from the join.
Although the join is usually faster, the simple equality search may be preferable.
To use the latter approach while keeping the search string private, you can upload the search string as a table:

.. code:: python

    table = cd.DataFrame({"s": ["to", "be", "or", "not", "to", "be"]})
    search = cd.DataFrame({"needle": ["not"]})

    # this will still reveal the number of matches
    matching = table[table["s"] == search["needle"].as_value()]

To use this search approach, rather than the join on the unique values, together with keeping the number of matches private, can work like this:

.. code:: python

    table = cd.DataFrame({"s": ["to", "be", "or", "not", "to", "be"]})
    search = cd.DataFrame({"needle": ["to"]})
    matches = table["s"] == search["needle"].as_value()
    # the result is a table with either a 1 (at least 1 match found) or 0 (no match)
    (matches.as_table()[""].sum(mode="regular") > 0).open()


.. _matching:
Substring search and regular expressions
----------------------------------------

Crandas supports searching within text cells with the
:meth:`.CSeries.contains` function, that checks whether a given substring
exists in a given string, and via the :meth:`.CSeries.fullmatch` function, that
checks whether the string matches a given regular expression.

For searching, the substring to search for can be given as a string (in which case
the servers learn which string is searched for); or as a :class:`.crandas.CSeries`
(in which case this information will remain hidden). In the latter case, the
function :meth:`.CSeriesColRef.as_value` can be used to interpret a single-row
column as a value that can be used in function definitions, e.g.:

.. code:: python

    import crandas as cd

    cdf = cd.DataFrame({"haystack": ["abcdefghijk", "lmnopqrstuvw"]})

    # Search; substring becomes known to servers
    needle = "vwx"
    cdf.assign(found=cdf["haystack"].contains(needle)).open()

    # Search; substring remains hidden from servers but same for all rows
    cneedle = cd.DataFrame({"needle": [needle]})["needle"].as_value()
    cdf.assign(needle=cneedle, found=lambda x: x.haystack.contains(x.needle)).open()

    # Private search (different substring per row)
    cneedles = cd.DataFrame({"needle": ["abcde", "klm"]})
    cdf.assign(needle=cneedles["needle"], found=lambda x: x.haystack.contains(x.needle)).open()

For regular expression matching, a regular expression can be given as documented
in :class:`.re.Re`. The regular expression can contain references to other
columns to require that the (the respetive row of the) given column should occur
in the regular expression at the given place:

.. code:: python

    import crandas as cd
    import crandas.re

    # Simple example
    cdf = cd.DataFrame({"a": ["has a!", "has bb!", "has xx?!"],
                        "b": ["",  "b",  "xxx"],
                        "c": ["a", "c", "c"]})

    re = crandas.re.Re(r"has (a|bb).*")
    cdf["a"].fullmatch(re, cdf["b"], cdf["c"]).as_table().open()

    # Example with substrings via (?1)

    cdf = cd.DataFrame({"a": ["has a!", "has bb!", "has xx?!"],
                        "b": ["",  "bb",  "xxx"],
                        "c": ["a", "c", "c"]})

    re = crandas.re.Re(r"has ((?1)|(?2)).*")
    cdf["a"].fullmatch(re, cdf["b"], cdf["c"]).as_table().open()


Strings and bytes
----------------------------
Whenever there is string data in a table that does not require string functionalities, it might make sense to convert it to bytes before uploading it to the engine, as byte encoding is considerably more efficient.

To specify a list of bytes rather than strings, you would do the following:

.. code:: python

    string_data = ["AAP", "Noot", "Mies", "VuUR", "Vis"]

    # This converts data to bytes
    byte_data = [s.encode() for s in string_data]

    df = cd.DataFrame({"strings": string_data, "bytes": byte_data})

While strings are more natural, when you upload large databases, using byte columns will make a difference in speed and efficiency.

After learning the things we can do with strings, we will learn how to work with dates in the crandas in the :ref:`next section<datetime>`.