Skip to content

crandas.string_metrics

String metrics for approximate string matching or comparison and in fuzzy string searching.

edit_distance(left, right, distance_type='levenshtein', **type_opts)

Computes the edit distance between two string columns.

By default, the Levenshtein edit distance is computed using function levenshtein_distance.

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

distance_type

The type of edit distance to compute. Currently, the types levenshtein, jaro and jaro-winkler are supported.

TYPE: str DEFAULT: 'levenshtein'

**type_opts

Additional options for the edit distance type.

DEFAULT: {}

RETURNS DESCRIPTION
CSeries

CSeries with the edit distances.

jaro_distance(left, right)

Computes the Jaro distance between two string columns.

The Jaro distance is defined as $1 - \text{jaro similarity}$.

For more information about the Jaro similarity/distance, see here

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

RETURNS DESCRIPTION
CSeries

CSeries with the Jaro distances (fixed-points in [0, 1]).

jaro_similarity(left, right)

Computes the Jaro similarity between two string columns.

For more information about the Jaro similarity/distance, see here

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

RETURNS DESCRIPTION
CSeries

CSeries with the Jaro similarities (fixed-points in [0, 1]).

jaro_winkler_distance(left, right, prefix_length=4, prefix_weight=0.1)

Computes the Jaro-Winkler distance between two string columns.

The Jaro-Winkler distance is defined as $1 - \text{jaro winkler similarity}$.

For more information about the Jaro-Winkler similarity/distance, see here

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

prefix_length

Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer

TYPE: int DEFAULT: 4

prefix_weight

Weight assigned to the string prefixes matching. This should be a float between 0 and $\frac{1}{prefix length}$.

TYPE: float DEFAULT: 0.1

RETURNS DESCRIPTION
CSeries

CSeries with the Jaro-Winkler distances (fixed-points in [0, 1]).

jaro_winkler_similarity(left, right, prefix_length=4, prefix_weight=0.1)

Computes the Jaro-Winkler similarity between two string columns.

For more information about the Jaro-Winkler similarity/distance, see here

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

prefix_length

Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer

TYPE: int DEFAULT: 4

prefix_weight

Weight assigned to the string prefixes matching. This should be a float between 0 and $\frac{1}{prefix length}$.

TYPE: float DEFAULT: 0.1

RETURNS DESCRIPTION
CSeries

CSeries with the Jaro-Winkler similarities (fixed-points in [0, 1]).

levenshtein_distance(left, right, score_cutoff=None)

Computes the edit distance between two string columns.

Compute the Levenshtein edit distance between two string columns, i.e., the minimum number of character insertions, deletions and substitutions required to transform one string into the other.

PARAMETER DESCRIPTION
left

The string columns to compare.

TYPE: CSeries or str

right

The string columns to compare.

TYPE: CSeries or str

score_cutoff

Maximum edit distance to consider. If the edit distance is larger than score_cutoff, score_cutoff + 1 is returned. If None, no cutoff is applied. A lower value improves performance.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
CSeries

CSeries with the edit distances (integers).