Fuzzy matching ================ This tutorial demonstrates how to leverage standard crandas functionality, such as filtering and vector columns, to implement a basic form of fuzzy string matching (a fuzzy inner join). While crandas does not inherently offer fuzzy string matching capabilities, this tutorial will show you how to implement this feature with creativity. The fuzzy matching operation works by comparing the word vectors of the names in the two :class:`CDataFrames<.CDataFrame>`. If the inner product of the vectors is greater than a certain threshold, the names are considered a match. First, ensure you have the ``fasttext`` library installed. If not, execute the command ``pip install fasttext-wheel`` in the terminal. .. warning:: Be aware that running this in a Jupyter environment might encounter memory issues, so it's recommended to either run this locally or reach out to Roseman Labs if you encounter issues. .. code:: python import fasttext import numpy as np import pandas as pd import crandas as cd # Uncomment the following 4 commented lines and change their value to set these session variables if not in a jupyter hosted by Roseman Labs # from crandas.base import session # from pathlib import Path # session.base_path = Path('base/path/to/vdl/secrets') # session.endpoint = 'https://localhost:9820/api/v1' # Show which version of crandas is being used (needs to be v1.7.0+ for this tutorial) cd.base.session.version Next, we'll define two functions, ``load_model`` and ``get_word_vector``. The ``load_model`` function will load a pre-trained model from the specified file. If the file does not exist or an invalid model is provided, it will train a new model using specified parameters. Once the model is trained, it is saved for future use. The ``get_word_vector`` function will return the word vector of a given word. In binary mode, it returns an integer list representing the sign of each element in the vector. In integer mode, it rescales the vector and returns an integer list. .. code:: python def load_model(model_filename): try: model = fasttext.load_model(model_filename) except ValueError: # train the model # parameters taken from: https://blog.reachsumit.com/posts/2020/07/spell-checker-fasttext/ model = fasttext.train_unsupervised('../../data/fuzzy_data/wordlist.txt', wordNgrams=1, minn=1, maxn=2, dim=300, ws=8, neg=8, epoch=4, minCount=1, bucket=900000) # cache the trained model model.save_model(model_filename) return model .. code:: python def get_word_vector(word): # evaluate the model and rescale & convert the vector to integer elements v = model.get_word_vector(word) if binary_mode: # courser approach return np.sign(v).astype(int).tolist() else: l2norm = np.linalg.norm(v) return (scaling_factor * v / l2norm).astype(int).tolist() Finally, the main part of the script defines the mode of operation (binary or integer), loads the model, defines two dataframes with similar but slightly different names (e.g. Samuel Vimes /Vamuel Simes), uploads these :class:`.CDataFrame` to the VDL, and applies the fuzzy matching operation. After matching, the script performs an inner join on the matching keys and displays the result. .. code:: python if __name__ == "__main__": binary_mode = False scaling_factor = 100 # scaling factor should not be too big, otherwise there can be an overflow, and we need to reserve 'headroom' in the field for the comparison if binary_mode: print('using binary mode (rounds every entry from a word vector to {-1,1})') threshold = 0 # to be determined... else: print('using integer mode (rounds every entry from a word vector to a bounded signed integer)') threshold = int(.85 * scaling_factor * scaling_factor) # .85 seems to work well # higher values makes the model more 'picky' (results in fewer matches, more missed) # lower values will make the model less 'picky' (result in more matches, more false positives) print('============================') print("Loading model...") model = load_model(model_filename='bigmodel.bin') print('============================\n\n') # Define source dataframes a and b df_a = pd.DataFrame( { "Name": ["Jansen, Piet", "Vimes, Samuel", "Ironfoundersson, Carrot", "Nobbs, Cecil Wormsborough St. John",], "Rank": ["Soldier", "Commander", "Captain", "Corporal"], "Age": [33, 52, 24, 39], }) df_b = pd.DataFrame( { "Name": ["Simes, Vamuel", "Ironfounderson, Carot", "Nobbs, Cecil Wormsborough St John",], "Favoritefood": ["Spaghetti", "Sprouts", "Pizza"], "Shoesize": [30, 40, 50], }) # Upload dataframes a, b a_data = cd.upload_pandas_dataframe(df_a, name="a",auto_bounds=True) b_data = cd.upload_pandas_dataframe(df_b, name="b", auto_bounds=True) # Set fuzzy matching key (column name to fuzzy match on) match_key = "Name" # Create columns with metadata: indices and word vectors for each dataframe a_extra = cd.DataFrame({ "ida": cd.Series(range(len(df_a[match_key]))), "veca": cd.Series(df_a[match_key].apply(get_word_vector))}, auto_bounds=True) b_extra = cd.DataFrame({ "idb": cd.Series(range(len(df_b[match_key]))), "vecb": cd.Series(df_b[match_key].apply(get_word_vector))},auto_bounds=True) # Append metadata to original dataframes a = cd.concat([a_data, a_extra], axis=1) b = cd.concat([b_data, b_extra], axis=1) allbs = [] input_a_length = a.shape[0] for i in range(input_a_length): id_to_use=a[i:i+1]["ida"].as_value() # value to compare: "vec" column of ith row value_to_compare=a[i:i+1]["veca"].as_value() # compute new "out" column which is 1 if the inner product is smaller than # the threshold, and 0 otherwise bf = b.assign(out=b["vecb"].inner(value_to_compare) > threshold)[["idb","out"]] bf = bf.assign(ida=id_to_use) allbs.append(bf) print(f"Fuzzy match all pairs: {round(100*i/(input_a_length+1))}%", end="\r") allbs = cd.concat(allbs) allbs_filtered = allbs[allbs["out"]==1][["ida","idb"]] print('============================') print("Join on fuzzy matched keys and show result") a_prime = cd.merge(a, allbs_filtered, how="inner", left_on='ida', right_on='ida') merged = cd.merge(a_prime, b, how="inner", left_on='idb', right_on='idb') print(merged.open()) print('============================\n\n') .. parsed-literal:: using integer mode (rounds every entry from a word vector to a bounded signed integer) ============================ Loading model... ============================ ============================ Join on fuzzy matched keys and show result Name_x Rank Age ida \ 0 Nobbs, Cecil Wormsborough St. John Corporal 39 3 1 Ironfoundersson, Carrot Captain 24 2 2 Vimes, Samuel Commander 52 1 veca idb \ 0 [11, -2, -20, 5, 2, -12, 4, 0, 1, 3, -17, 6, 0... 2 1 [9, -4, -17, 6, -2, -8, 1, -4, -3, 0, -15, 0, ... 1 2 [10, 0, -21, 5, -4, -14, 3, 0, -2, 5, -19, 3, ... 0 Name_y Favoritefood Shoesize \ 0 Nobbs, Cecil Wormsborough St John Pizza 50 1 Ironfounderson, Carot Sprouts 40 2 Simes, Vamuel Spaghetti 30 vecb 0 [10, -2, -18, 5, 1, -12, 3, 0, 0, 2, -17, 5, 0... 1 [11, -3, -19, 6, -1, -9, 2, -3, -3, 0, -16, 0,... 2 [9, 0, -20, 6, -5, -14, 4, 0, -2, 5, -19, 4, -... ============================