crandas.string_metrics

String metrics for approximate string matching or comparison and in fuzzy string searching.

crandas.string_metrics.edit_distance(left, right, distance_type='levenshtein', **type_opts) CSeries

Computes the edit distance between two string columns.

By default, the Levenshtein edit distance is computed using function levenshtein_distance().

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

  • distance_type (str, optional) – The type of edit distance to compute. Currently, the types 'levenshtein', 'jaro' and 'jaro-winkler' are supported.

  • **type_opts – Additional options for the edit distance type.

Returns:

CSeries with the edit distances.

Return type:

CSeries

crandas.string_metrics.jaro_distance(left, right) CSeries

Computes the Jaro distance between two string columns.

The Jaro distance is defined as 1 - jaro_similarity.

For more information about the Jaro similarity/distance, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

Returns:

CSeries with the Jaro distances (fixed-points in [0, 1]).

Return type:

CSeries

crandas.string_metrics.jaro_similarity(left, right) CSeries

Computes the Jaro similarity between two string columns.

For more information about the Jaro similarity/distance, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

Returns:

CSeries with the Jaro similarities (fixed-points in [0, 1]).

Return type:

CSeries

crandas.string_metrics.jaro_winkler_distance(left, right, prefix_length=4, prefix_weight=0.1) CSeries

Computes the Jaro-Winkler distance between two string columns.

The Jaro-Winkler distance is defined as 1 - jaro_winkler_similarity.

For more information about the Jaro-Winkler similarity/distance, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

  • prefix_length (int) – Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer

  • prefix_weight (float) – Weight assigned to the string prefixes matching. This should be a float between 0 and 1 / prefix_length.

Returns:

CSeries with the Jaro-Winkler distances (fixed-points in [0, 1]).

Return type:

CSeries

crandas.string_metrics.jaro_winkler_similarity(left, right, prefix_length=4, prefix_weight=0.1) CSeries

Computes the Jaro-Winkler similarity between two string columns.

For more information about the Jaro-Winkler similarity/distance, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

  • prefix_length (int) – Integer denoting the length of the string prefix to take into account. This needs to be a non-negative integer

  • prefix_weight (float) – Weight assigned to the string prefixes matching. This should be a float between 0 and 1 / prefix_length.

Returns:

CSeries with the Jaro-Winkler similarities (fixed-points in [0, 1]).

Return type:

CSeries

crandas.string_metrics.levenshtein_distance(left, right, score_cutoff=None) CSeries

Computes the edit distance between two string columns.

Compute the Levenshtein edit distance between two string columns, i.e., the minimum number of character insertions, deletions and substitutions required to transform one string into the other.

Parameters:
  • left (CSeries or str) – The string columns to compare.

  • right (CSeries or str) – The string columns to compare.

  • score_cutoff (int, optional) – Maximum edit distance to consider. If the edit distance is larger than score_cutoff, score_cutoff + 1 is returned. If None, no cutoff is applied. A lower value improves performance.

Returns:

CSeries with the edit distances (integers).

Return type:

CSeries