Fuzzy matching#

This tutorial demonstrates how to leverage standard crandas functionality, such as filtering and vector columns, to implement a basic form of fuzzy string matching (a fuzzy inner join). While crandas does not inherently offer fuzzy string matching capabilities, this tutorial will show you how to implement this feature with creativity.

The fuzzy matching operation works by comparing the word vectors of the names in the two CDataFrames. If the inner product of the vectors is greater than a certain threshold, the names are considered a match.

First, ensure you have the fasttext library installed. If not, execute the command pip install fasttext-wheel in the terminal.

Warning

Be aware that running this in a Jupyter environment might encounter memory issues, so it’s recommended to either run this locally or reach out to Roseman Labs if you encounter issues.

import fasttext
import numpy as np
import pandas as pd
import crandas as cd

# Uncomment the following 4 commented lines and change their value to set these session variables if not in a jupyter hosted by Roseman Labs

# from crandas.base import session
# from pathlib import Path

# session.base_path = Path('base/path/to/vdl/secrets')
# session.endpoint = 'https://localhost:9820/api/v1'

# Show which version of crandas is being used (needs to be v1.7.0+ for this tutorial)
cd.base.session.version

Next, we’ll define two functions, load_model and get_word_vector.

The load_model function will load a pre-trained model from the specified file. If the file does not exist or an invalid model is provided, it will train a new model using specified parameters. Once the model is trained, it is saved for future use.

The get_word_vector function will return the word vector of a given word. In binary mode, it returns an integer list representing the sign of each element in the vector. In integer mode, it rescales the vector and returns an integer list.

def load_model(model_filename):
    try:
        model = fasttext.load_model(model_filename)
    except ValueError:
        # train the model
        # parameters taken from: https://blog.reachsumit.com/posts/2020/07/spell-checker-fasttext/
        model = fasttext.train_unsupervised('../../data/fuzzy_data/wordlist.txt', wordNgrams=1, minn=1, maxn=2, dim=300, ws=8, neg=8, epoch=4, minCount=1, bucket=900000)
        # cache the trained model
        model.save_model(model_filename)
    return model
def get_word_vector(word):
    # evaluate the model and rescale & convert the vector to integer elements
    v = model.get_word_vector(word)
    if binary_mode:
        # courser approach
        return np.sign(v).astype(int).tolist()
    else:
        l2norm = np.linalg.norm(v)
        return (scaling_factor * v / l2norm).astype(int).tolist()

Finally, the main part of the script defines the mode of operation (binary or integer), loads the model, defines two dataframes with similar but slightly different names (e.g. Samuel Vimes /Vamuel Simes), uploads these CDataFrame to the VDL, and applies the fuzzy matching operation.

After matching, the script performs an inner join on the matching keys and displays the result.

if __name__ == "__main__":

    binary_mode = False
    scaling_factor = 100
    # scaling factor should not be too big, otherwise there can be an overflow, and we need to reserve 'headroom' in the field for the comparison

    if binary_mode:
        print('using binary mode (rounds every entry from a word vector to {-1,1})')
        threshold = 0 # to be determined...
    else:
        print('using integer mode (rounds every entry from a word vector to a bounded signed integer)')
        threshold = int(.85 * scaling_factor * scaling_factor)
        # .85 seems to work well
        # higher values makes the model more 'picky' (results in fewer matches, more missed)
        # lower values will make the model less 'picky' (result in more matches, more false positives)

    print('============================')
    print("Loading model...")
    model = load_model(model_filename='bigmodel.bin')
    print('============================\n\n')

    # Define source dataframes a and b
    df_a = pd.DataFrame(
        {
            "Name": ["Jansen, Piet", "Vimes, Samuel", "Ironfoundersson, Carrot", "Nobbs, Cecil Wormsborough St. John",],
            "Rank": ["Soldier", "Commander", "Captain", "Corporal"],
            "Age": [33, 52, 24, 39],
        })

    df_b = pd.DataFrame(
        {
            "Name": ["Simes, Vamuel", "Ironfounderson, Carot", "Nobbs, Cecil Wormsborough St John",],
            "Favoritefood": ["Spaghetti", "Sprouts", "Pizza"],
            "Shoesize": [30, 40, 50],
        })

    # Upload dataframes a, b
    a_data = cd.upload_pandas_dataframe(df_a, name="a",auto_bounds=True)
    b_data = cd.upload_pandas_dataframe(df_b, name="b", auto_bounds=True)

    # Set fuzzy matching key (column name to fuzzy match on)
    match_key = "Name"
    # Create columns with metadata: indices and word vectors for each dataframe
    a_extra = cd.DataFrame({
        "ida": cd.Series(range(len(df_a[match_key]))),
        "veca": cd.Series(df_a[match_key].apply(get_word_vector))}, auto_bounds=True)
    b_extra = cd.DataFrame({
        "idb": cd.Series(range(len(df_b[match_key]))),
        "vecb": cd.Series(df_b[match_key].apply(get_word_vector))},auto_bounds=True)

    # Append metadata to original dataframes
    a = cd.concat([a_data, a_extra], axis=1)
    b = cd.concat([b_data, b_extra], axis=1)

    allbs = []
    input_a_length = a.shape[0]
    for i in range(input_a_length):
        id_to_use=a[i:i+1]["ida"].as_value()
        # value to compare: "vec" column of ith row
        value_to_compare=a[i:i+1]["veca"].as_value()
        # compute new "out" column which is 1 if the inner product is smaller than
        # the threshold, and 0 otherwise
        bf = b.assign(out=b["vecb"].inner(value_to_compare) > threshold)[["idb","out"]]
        bf = bf.assign(ida=id_to_use)
        allbs.append(bf)
        print(f"Fuzzy match all pairs: {round(100*i/(input_a_length+1))}%", end="\r")
    allbs = cd.concat(allbs)

    allbs_filtered = allbs[allbs["out"]==1][["ida","idb"]]

    print('============================')
    print("Join on fuzzy matched keys and show result")
    a_prime = cd.merge(a, allbs_filtered, how="inner", left_on='ida', right_on='ida')
    merged = cd.merge(a_prime, b, how="inner", left_on='idb', right_on='idb')
    print(merged.open())
    print('============================\n\n')
using integer mode (rounds every entry from a word vector to a bounded signed integer)
============================
Loading model...
============================


============================
Join on fuzzy matched keys and show result
                               Name_x       Rank  Age  ida  0  Nobbs, Cecil Wormsborough St. John   Corporal   39    3
1             Ironfoundersson, Carrot    Captain   24    2
2                       Vimes, Samuel  Commander   52    1

                                                veca  idb  0  [11, -2, -20, 5, 2, -12, 4, 0, 1, 3, -17, 6, 0...    2
1  [9, -4, -17, 6, -2, -8, 1, -4, -3, 0, -15, 0, ...    1
2  [10, 0, -21, 5, -4, -14, 3, 0, -2, 5, -19, 3, ...    0

                              Name_y Favoritefood  Shoesize  0  Nobbs, Cecil Wormsborough St John        Pizza        50
1              Ironfounderson, Carot      Sprouts        40
2                      Simes, Vamuel    Spaghetti        30

                                                vecb
0  [10, -2, -18, 5, 1, -12, 3, 0, 0, 2, -17, 5, 0...
1  [11, -3, -19, 6, -1, -9, 2, -3, -3, 0, -16, 0,...
2  [9, 0, -20, 6, -5, -14, 4, 0, -2, 5, -19, 4, -...
============================