Fuzzy matching¶
This tutorial demonstrates how to leverage standard crandas functionality, such as filtering and vector columns, to implement a basic form of fuzzy string matching (a fuzzy inner join). While crandas does not inherently offer fuzzy string matching capabilities, this tutorial will show you how to implement this feature with creativity.
The fuzzy matching operation works by comparing the word vectors of the names in the two CDataFrames
. If the inner product of the vectors is greater than a certain threshold, the names are considered a match.
First, ensure you have the fasttext
library installed. If not, execute the command pip install fasttext-wheel
in the terminal.
Warning
Be aware that running this in a Jupyter environment might encounter memory issues, so it’s recommended to either run this locally or reach out to Roseman Labs if you encounter issues.
import fasttext
import numpy as np
import pandas as pd
import crandas as cd
# Uncomment the following 4 commented lines and change their value to set these session variables if not in a jupyter hosted by Roseman Labs
# from crandas.base import session
# from pathlib import Path
# session.base_path = Path('base/path/to/vdl/secrets')
# session.endpoint = 'https://localhost:9820/api/v1'
# Show which version of crandas is being used (needs to be v1.7.0+ for this tutorial)
cd.base.session.version
Next, we’ll define two functions, load_model
and get_word_vector
.
The load_model
function will load a pre-trained model from the specified file. If the file does not exist or an invalid model is provided, it will train a new model using specified parameters. Once the model is trained, it is saved for future use.
The get_word_vector
function will return the word vector of a given word. In binary mode, it returns an integer list representing the sign of each element in the vector. In integer mode, it rescales the vector and returns an integer list.
def load_model(model_filename):
try:
model = fasttext.load_model(model_filename)
except ValueError:
# train the model
# parameters taken from: https://blog.reachsumit.com/posts/2020/07/spell-checker-fasttext/
model = fasttext.train_unsupervised('../../data/fuzzy_data/wordlist.txt', wordNgrams=1, minn=1, maxn=2, dim=300, ws=8, neg=8, epoch=4, minCount=1, bucket=900000)
# cache the trained model
model.save_model(model_filename)
return model
def get_word_vector(word):
# evaluate the model and rescale & convert the vector to integer elements
v = model.get_word_vector(word)
if binary_mode:
# courser approach
return np.sign(v).astype(int).tolist()
else:
l2norm = np.linalg.norm(v)
return (scaling_factor * v / l2norm).astype(int).tolist()
Finally, the main part of the script defines the mode of operation (binary or integer), loads the model, defines two dataframes with similar but slightly different names (e.g. Samuel Vimes /Vamuel Simes), uploads these CDataFrame
to the VDL, and applies the fuzzy matching operation.
After matching, the script performs an inner join on the matching keys and displays the result.
if __name__ == "__main__":
binary_mode = False
scaling_factor = 100
# scaling factor should not be too big, otherwise there can be an overflow, and we need to reserve 'headroom' in the field for the comparison
if binary_mode:
print('using binary mode (rounds every entry from a word vector to {-1,1})')
threshold = 0 # to be determined...
else:
print('using integer mode (rounds every entry from a word vector to a bounded signed integer)')
threshold = int(.85 * scaling_factor * scaling_factor)
# .85 seems to work well
# higher values makes the model more 'picky' (results in fewer matches, more missed)
# lower values will make the model less 'picky' (result in more matches, more false positives)
print('============================')
print("Loading model...")
model = load_model(model_filename='bigmodel.bin')
print('============================\n\n')
# Define source dataframes a and b
df_a = pd.DataFrame(
{
"Name": ["Jansen, Piet", "Vimes, Samuel", "Ironfoundersson, Carrot", "Nobbs, Cecil Wormsborough St. John",],
"Rank": ["Soldier", "Commander", "Captain", "Corporal"],
"Age": [33, 52, 24, 39],
})
df_b = pd.DataFrame(
{
"Name": ["Simes, Vamuel", "Ironfounderson, Carot", "Nobbs, Cecil Wormsborough St John",],
"Favoritefood": ["Spaghetti", "Sprouts", "Pizza"],
"Shoesize": [30, 40, 50],
})
# Upload dataframes a, b
a_data = cd.upload_pandas_dataframe(df_a, name="a",auto_bounds=True)
b_data = cd.upload_pandas_dataframe(df_b, name="b", auto_bounds=True)
# Set fuzzy matching key (column name to fuzzy match on)
match_key = "Name"
# Create columns with metadata: indices and word vectors for each dataframe
a_extra = cd.DataFrame({
"ida": cd.Series(range(len(df_a[match_key]))),
"veca": cd.Series(df_a[match_key].apply(get_word_vector))}, auto_bounds=True)
b_extra = cd.DataFrame({
"idb": cd.Series(range(len(df_b[match_key]))),
"vecb": cd.Series(df_b[match_key].apply(get_word_vector))},auto_bounds=True)
# Append metadata to original dataframes
a = cd.concat([a_data, a_extra], axis=1)
b = cd.concat([b_data, b_extra], axis=1)
allbs = []
input_a_length = a.shape[0]
for i in range(input_a_length):
id_to_use=a[i:i+1]["ida"].as_value()
# value to compare: "vec" column of ith row
value_to_compare=a[i:i+1]["veca"].as_value()
# compute new "out" column which is 1 if the inner product is smaller than
# the threshold, and 0 otherwise
bf = b.assign(out=b["vecb"].inner(value_to_compare) > threshold)[["idb","out"]]
bf = bf.assign(ida=id_to_use)
allbs.append(bf)
print(f"Fuzzy match all pairs: {round(100*i/(input_a_length+1))}%", end="\r")
allbs = cd.concat(allbs)
allbs_filtered = allbs[allbs["out"]==1][["ida","idb"]]
print('============================')
print("Join on fuzzy matched keys and show result")
a_prime = cd.merge(a, allbs_filtered, how="inner", left_on='ida', right_on='ida')
merged = cd.merge(a_prime, b, how="inner", left_on='idb', right_on='idb')
print(merged.open())
print('============================\n\n')
using integer mode (rounds every entry from a word vector to a bounded signed integer) ============================ Loading model... ============================ ============================ Join on fuzzy matched keys and show result Name_x Rank Age ida 0 Nobbs, Cecil Wormsborough St. John Corporal 39 3 1 Ironfoundersson, Carrot Captain 24 2 2 Vimes, Samuel Commander 52 1 veca idb 0 [11, -2, -20, 5, 2, -12, 4, 0, 1, 3, -17, 6, 0... 2 1 [9, -4, -17, 6, -2, -8, 1, -4, -3, 0, -15, 0, ... 1 2 [10, 0, -21, 5, -4, -14, 3, 0, -2, 5, -19, 3, ... 0 Name_y Favoritefood Shoesize 0 Nobbs, Cecil Wormsborough St John Pizza 50 1 Ironfounderson, Carot Sprouts 40 2 Simes, Vamuel Spaghetti 30 vecb 0 [10, -2, -18, 5, 1, -12, 3, 0, 0, 2, -17, 5, 0... 1 [11, -3, -19, 6, -1, -9, 2, -3, -3, 0, -16, 0,... 2 [9, 0, -20, 6, -5, -14, 4, 0, -2, 5, -19, 4, -... ============================