.. _desc_stats:

.. meta::
    :description: Approximate string matching and fuzzy string search on encrypted data with crandas.
    :keywords: crandas, Roseman Labs, User Guide, String similarity, String matching, edit distance

.. _fuzzy_string_matching:

Fuzzy string matching
#####################

Crandas provides functionalities for computing the similarity of strings, which allow you to match and filter strings which are approximately the same, but not necessarily identical.
For example, it may be necessary to find whether a name occurs within a dataset, even in the presence of typos, spelling variants or formatting differences.
Exact equality and substring search are often too strict for purpose.

This section describes two functionalities in crandas which can be used for this purpose, namely:

- **Levenshtein distance**
- **Jaro-Winkler similarity**


Levenshtein distance
====================

The Levenshtein distance measures the number of character insertions, deletions and substitutions required to transform one string into the other.

To compute the Levenshtein distance between a public string and a column containing private strings, the following can be used:

.. code:: python

    import crandas as cd
    from crandas import string_metrics

    table = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)

    distance = string_metrics.levenshtein_distance("marta", table["names"])

Opening the result will give the following table:

.. code:: python

    >>> distance.open()
    0    1
    1    6
    Name: , dtype: int64


As expected, this shows that the distance between "marta" and "martha" is only 1, for the deleted character 'h'.
On the other hand, the distance to "louise" is significantly larger.

It is also possible to compute the Levenshtein distance between strings of two private columns, using the following:

.. code:: python

    table1 = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
    table2 = cd.DataFrame({"names": ["marta", "louisse"]}, auto_bounds=True)

    distance = string_metrics.levenshtein_distance(table1["names"], table2["names"])

Opening the result will give the following table:

.. code:: python

    >>> distance.open()
    0    1
    1    1
    Name: , dtype: int64


Jaro-Winkler similarity
=======================

The Jaro-Winkler similarity measures the similarity of two strings based on several aspects, including swapped characters and a common prefix.
It is a number between 0 and 1, with a number close to 0 indicating a low similarity, and a number close to 1 indicating a near-match.
The similarity is normalized based on the string lengths.

To compute the Jaro-Winkler similarity between a public string and a column containing private strings, the following code can be used:

.. code:: python

    from crandas import string_metrics

    table = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)

    similarity = string_metrics.jaro_winkler_similarity("marta", table["names"])

Opening the result will give the following table:

.. code:: python

    >>> similarity.open()
    0    0.966666
    1    0.000000
    Name: , dtype: float64


This shows a high similarity between "marta" and "martha", as expected. 
Meanwhile, the similarity with "louise" is 0.

It is also possible to compute the Jaro-Winkler similarity of strings from two private columns, using the following code:

.. code:: python

    table1 = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
    table2 = cd.DataFrame({"names": ["marta", "louisse"]}, auto_bounds=True)

    similarity = string_metrics.jaro_winkler_similarity(table1["names"], table2["names"])

Opening the result will give the following table:

.. code:: python

    >>> similarity.open()
    0    0.966666
    1    0.971428
    Name: , dtype: float64


In addition to the Jaro-Winkler similarity, crandas also provides the Jaro similarity.
The Jaro similarity is simpler than the Jaro-Winkler similarity, as the former does not account for a shared prefix.
In practice, the Jaro-Winkler similarity is more commonly used.
The Jaro similarity functionality can be accessed through the :func:`jaro_similarity<.crandas.string_metrics.jaro_similarity>` function, which works in a manner analogous to :func:`jaro_winkler_similarity<.crandas.string_metrics.jaro_winkler_similarity>` 

The edit distance function :func:`edit_distance<.crandas.string_metrics.edit_distance>` provides a convenience function that dispatches to the specific functions for Levenshtein distance, Jaro similarity, and Jaro-Winkler similarity, depending on the ``distance_type`` parameter.


Approximate searches
====================

The Levenshtein distance and Jaro-Winkler similarity can be used for approximate string matching, including counting the number of matches, as well as filtering.

Public search string
--------------------


For instance, to determine whether a publicly known string occurs in a private dataset using the Levenshtein distance:

.. code:: python

    from crandas import string_metrics

    table = cd.DataFrame({"names": ["ellis", "mikael", "mary"]}, auto_bounds=True)

    # For improved performance, the score_cutoff parameter can be used. Any
    # strings with a higher distance than the cutoff are assigned a distance of 4.
    table = table.assign(
        distance=string_metrics.levenshtein_distance(
            "michael", table["names"], score_cutoff=3
        )
    )

To count the number of strings with distance at most 3:

.. code:: python

    >>> (table["distance"] <= 3).sum()
    1

To filter records with a distance at most 3:

    >>> table[table["distance"] <= 3].open()
 	    names 	distance
    0 	mikael 	2

A similar result can be achieved using the Jaro-Winkler similarity.

Private search string
---------------------

Both Levenshtein and Jaro-Winkler methods can also be used to match a private search string against a private dataset.
Such a private search string might be the result of another crandas operation, such as a join.

For instance, using the Jaro-Winkler similarity, the following code can be used:

.. code:: python

    from crandas import string_metrics

    private_string = cd.DataFrame({"name": ["michael"]}, auto_bounds=True)

    table = cd.DataFrame({"names": ["ellis", "mikael", "mary"]}, auto_bounds=True)
    
    table = table.assign(
        similarity=string_metrics.jaro_winkler_similarity(
            table["names"], private_string["name"].as_value()
        )
    )

To count the number of strings with a similarity of at least 0.85:

.. code:: python

    >>> (table["similarity"] >= 0.85).sum()
    1

To filter records with a similarity of at least 0.85:

    >>> table[table["similarity"] >= 0.85].open()
    names 	similarity
    0 	mikael 	0.879365

A similar result can be achieved using the Levenshtein distance.