Fuzzy string matching¶
Crandas provides functionalities for computing the similarity of strings, which allow you to match and filter strings which are approximately the same, but not necessarily identical. For example, it may be necessary to find whether a name occurs within a dataset, even in the presence of typos, spelling variants or formatting differences. Exact equality and substring search are often too strict for purpose.
This section describes two functionalities in crandas which can be used for this purpose, namely:
Levenshtein distance
Jaro-Winkler similarity
Levenshtein distance¶
The Levenshtein distance measures the number of character insertions, deletions and substitutions required to transform one string into the other.
To compute the Levenshtein distance between a public string and a column containing private strings, the following can be used:
import crandas as cd
from crandas import string_metrics
table = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
distance = string_metrics.levenshtein_distance("marta", table["names"])
Opening the result will give the following table:
>>> distance.open()
0 1
1 6
Name: , dtype: int64
As expected, this shows that the distance between “marta” and “martha” is only 1, for the deleted character ‘h’. On the other hand, the distance to “louise” is significantly larger.
It is also possible to compute the Levenshtein distance between strings of two private columns, using the following:
table1 = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
table2 = cd.DataFrame({"names": ["marta", "louisse"]}, auto_bounds=True)
distance = string_metrics.levenshtein_distance(table1["names"], table2["names"])
Opening the result will give the following table:
>>> distance.open()
0 1
1 1
Name: , dtype: int64
Jaro-Winkler similarity¶
The Jaro-Winkler similarity measures the similarity of two strings based on several aspects, including swapped characters and a common prefix. It is a number between 0 and 1, with a number close to 0 indicating a low similarity, and a number close to 1 indicating a near-match. The similarity is normalized based on the string lengths.
To compute the Jaro-Winkler similarity between a public string and a column containing private strings, the following code can be used:
from crandas import string_metrics
table = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
similarity = string_metrics.jaro_winkler_similarity("marta", table["names"])
Opening the result will give the following table:
>>> similarity.open()
0 0.966666
1 0.000000
Name: , dtype: float64
This shows a high similarity between “marta” and “martha”, as expected. Meanwhile, the similarity with “louise” is 0.
It is also possible to compute the Jaro-Winkler similarity of strings from two private columns, using the following code:
table1 = cd.DataFrame({"names": ["martha", "louise"]}, auto_bounds=True)
table2 = cd.DataFrame({"names": ["marta", "louisse"]}, auto_bounds=True)
similarity = string_metrics.jaro_winkler_similarity(table1["names"], table2["names"])
Opening the result will give the following table:
>>> similarity.open()
0 0.966666
1 0.971428
Name: , dtype: float64
In addition to the Jaro-Winkler similarity, crandas also provides the Jaro similarity.
The Jaro similarity is simpler than the Jaro-Winkler similarity, as the former does not account for a shared prefix.
In practice, the Jaro-Winkler similarity is more commonly used.
The Jaro similarity functionality can be accessed through the jaro_similarity function, which works in a manner analogous to jaro_winkler_similarity
The edit distance function edit_distance provides a convenience function that dispatches to the specific functions for Levenshtein distance, Jaro similarity, and Jaro-Winkler similarity, depending on the distance_type parameter.
Approximate searches¶
The Levenshtein distance and Jaro-Winkler similarity can be used for approximate string matching, including counting the number of matches, as well as filtering.
Public search string¶
For instance, to determine whether a publicly known string occurs in a private dataset using the Levenshtein distance:
from crandas import string_metrics
table = cd.DataFrame({"names": ["ellis", "mikael", "mary"]}, auto_bounds=True)
# For improved performance, the score_cutoff parameter can be used. Any
# strings with a higher distance than the cutoff are assigned a distance of 4.
table = table.assign(
distance=string_metrics.levenshtein_distance(
"michael", table["names"], score_cutoff=3
)
)
To count the number of strings with distance at most 3:
>>> (table["distance"] <= 3).sum()
1
To filter records with a distance at most 3:
>>> table[table["distance"] <= 3].open()
names distance
0 mikael 2
A similar result can be achieved using the Jaro-Winkler similarity.
Private search string¶
Both Levenshtein and Jaro-Winkler methods can also be used to match a private search string against a private dataset. Such a private search string might be the result of another crandas operation, such as a join.
For instance, using the Jaro-Winkler similarity, the following code can be used:
from crandas import string_metrics
private_string = cd.DataFrame({"name": ["michael"]}, auto_bounds=True)
table = cd.DataFrame({"names": ["ellis", "mikael", "mary"]}, auto_bounds=True)
table = table.assign(
similarity=string_metrics.jaro_winkler_similarity(
table["names"], private_string["name"].as_value()
)
)
To count the number of strings with a similarity of at least 0.85:
>>> (table["similarity"] >= 0.85).sum()
1
To filter records with a similarity of at least 0.85:
>>> table[table["similarity"] >= 0.85].open()
names similarity
0 mikael 0.879365
A similar result can be achieved using the Levenshtein distance.