crandas.string_metrics
String metrics for approximate string matching or comparison and in fuzzy string searching.
edit_distance(left, right, distance_type='levenshtein', **type_opts)
Computes the edit distance between two string columns.
By default, the Levenshtein edit distance is computed using function levenshtein_distance.
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
distance_type
|
The type of edit distance to compute. Currently, the types
TYPE:
|
**type_opts
|
Additional options for the edit distance type.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the edit distances. |
jaro_distance(left, right)
Computes the Jaro distance between two string columns.
The Jaro distance is defined as $1 - \text{jaro similarity}$.
For more information about the Jaro similarity/distance, see here
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the Jaro distances (fixed-points in [0, 1]). |
jaro_similarity(left, right)
Computes the Jaro similarity between two string columns.
For more information about the Jaro similarity/distance, see here
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the Jaro similarities (fixed-points in [0, 1]). |
jaro_winkler_distance(left, right, prefix_length=4, prefix_weight=0.1)
Computes the Jaro-Winkler distance between two string columns.
The Jaro-Winkler distance is defined as $1 - \text{jaro winkler similarity}$.
For more information about the Jaro-Winkler similarity/distance, see here
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
prefix_length
|
Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer
TYPE:
|
prefix_weight
|
Weight assigned to the string prefixes matching. This should be a float between 0 and $\frac{1}{prefix length}$.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the Jaro-Winkler distances (fixed-points in [0, 1]). |
jaro_winkler_similarity(left, right, prefix_length=4, prefix_weight=0.1)
Computes the Jaro-Winkler similarity between two string columns.
For more information about the Jaro-Winkler similarity/distance, see here
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
prefix_length
|
Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer
TYPE:
|
prefix_weight
|
Weight assigned to the string prefixes matching. This should be a float between 0 and $\frac{1}{prefix length}$.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the Jaro-Winkler similarities (fixed-points in [0, 1]). |
levenshtein_distance(left, right, score_cutoff=None)
Computes the edit distance between two string columns.
Compute the Levenshtein edit distance between two string columns, i.e., the minimum number of character insertions, deletions and substitutions required to transform one string into the other.
| PARAMETER | DESCRIPTION |
|---|---|
left
|
The string columns to compare.
TYPE:
|
right
|
The string columns to compare.
TYPE:
|
score_cutoff
|
Maximum edit distance to consider. If the edit distance is larger
than
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CSeries
|
CSeries with the edit distances (integers). |