Fuzzy matching 
================
    

This tutorial demonstrates how to leverage standard crandas functionality, such as filtering and vector columns, to implement a basic form of fuzzy string matching (a fuzzy inner join). While crandas does not inherently offer fuzzy string matching capabilities, this tutorial will show you how to implement this feature with creativity.

The fuzzy matching operation works by comparing the word vectors of the names in the two :class:`CDataFrames<.CDataFrame>`. If the inner product of the vectors is greater than a certain threshold, the names are considered a match.

First, ensure you have the ``fasttext`` library installed. If not, execute the command ``pip install fasttext-wheel`` in the terminal. 

.. warning::

    Be aware that running this in a Jupyter environment might encounter memory issues, so it's recommended to either run this locally or reach out to Roseman Labs if you encounter issues.
    

.. code:: python
    
    import fasttext
    import numpy as np
    import pandas as pd
    import crandas as cd

    # Uncomment the following 4 commented lines and change their value to set these session variables if not in a jupyter hosted by Roseman Labs
    
    # from crandas.base import session
    # from pathlib import Path
    
    # session.base_path = Path('base/path/to/vdl/secrets') 
    # session.endpoint = 'https://localhost:9820/api/v1'
    
    # Show which version of crandas is being used (needs to be v1.7.0+ for this tutorial)
    cd.base.session.version

Next, we'll define two functions, ``load_model`` and ``get_word_vector``. 

The ``load_model`` function will load a pre-trained model from the specified file. If the file does not exist or an invalid model is provided, it will train a new model using specified parameters. Once the model is trained, it is saved for future use.

The ``get_word_vector`` function will return the word vector of a given word. In binary mode, it returns an integer list representing the sign of each element in the vector. In integer mode, it rescales the vector and returns an integer list.

.. code:: python

    def load_model(model_filename):
        try:
            model = fasttext.load_model(model_filename)
        except ValueError:
            # train the model
            # parameters taken from: https://blog.reachsumit.com/posts/2020/07/spell-checker-fasttext/
            model = fasttext.train_unsupervised('../../data/fuzzy_data/wordlist.txt', wordNgrams=1, minn=1, maxn=2, dim=300, ws=8, neg=8, epoch=4, minCount=1, bucket=900000)
            # cache the trained model
            model.save_model(model_filename)
        return model

.. code:: python

    def get_word_vector(word):
        # evaluate the model and rescale & convert the vector to integer elements
        v = model.get_word_vector(word)
        if binary_mode:
            # courser approach
            return np.sign(v).astype(int).tolist()
        else:
            l2norm = np.linalg.norm(v)
            return (scaling_factor * v / l2norm).astype(int).tolist()

Finally, the main part of the script defines the mode of operation (binary or integer), loads the model, defines two dataframes with similar but slightly different names (e.g. Samuel Vimes /Vamuel Simes), uploads these :class:`.CDataFrame` to the VDL, and applies the fuzzy matching operation.

After matching, the script performs an inner join on the matching keys and displays the result.

.. code:: python

    if __name__ == "__main__":
    
        binary_mode = False
        scaling_factor = 100
        # scaling factor should not be too big, otherwise there can be an overflow, and we need to reserve 'headroom' in the field for the comparison
    
        if binary_mode:
            print('using binary mode (rounds every entry from a word vector to {-1,1})')
            threshold = 0 # to be determined...
        else:
            print('using integer mode (rounds every entry from a word vector to a bounded signed integer)')
            threshold = int(.85 * scaling_factor * scaling_factor)
            # .85 seems to work well
            # higher values makes the model more 'picky' (results in fewer matches, more missed)
            # lower values will make the model less 'picky' (result in more matches, more false positives)
    
        print('============================')
        print("Loading model...")
        model = load_model(model_filename='bigmodel.bin')
        print('============================\n\n')
    
        # Define source dataframes a and b
        df_a = pd.DataFrame(
            {
                "Name": ["Jansen, Piet", "Vimes, Samuel", "Ironfoundersson, Carrot", "Nobbs, Cecil Wormsborough St. John",],
                "Rank": ["Soldier", "Commander", "Captain", "Corporal"],
                "Age": [33, 52, 24, 39],
            })
    
        df_b = pd.DataFrame(
            {
                "Name": ["Simes, Vamuel", "Ironfounderson, Carot", "Nobbs, Cecil Wormsborough St John",],
                "Favoritefood": ["Spaghetti", "Sprouts", "Pizza"],
                "Shoesize": [30, 40, 50],
            })
    
        # Upload dataframes a, b
        a_data = cd.upload_pandas_dataframe(df_a, name="a",auto_bounds=True)
        b_data = cd.upload_pandas_dataframe(df_b, name="b", auto_bounds=True)
    
        # Set fuzzy matching key (column name to fuzzy match on)
        match_key = "Name"
        # Create columns with metadata: indices and word vectors for each dataframe
        a_extra = cd.DataFrame({
            "ida": cd.Series(range(len(df_a[match_key]))),
            "veca": cd.Series(df_a[match_key].apply(get_word_vector))}, auto_bounds=True)
        b_extra = cd.DataFrame({
            "idb": cd.Series(range(len(df_b[match_key]))),
            "vecb": cd.Series(df_b[match_key].apply(get_word_vector))},auto_bounds=True)
    
        # Append metadata to original dataframes
        a = cd.concat([a_data, a_extra], axis=1)
        b = cd.concat([b_data, b_extra], axis=1)
    
        allbs = []
        input_a_length = a.shape[0]
        for i in range(input_a_length):
            id_to_use=a[i:i+1]["ida"].as_value()
            # value to compare: "vec" column of ith row
            value_to_compare=a[i:i+1]["veca"].as_value()
            # compute new "out" column which is 1 if the inner product is smaller than
            # the threshold, and 0 otherwise
            bf = b.assign(out=b["vecb"].inner(value_to_compare) > threshold)[["idb","out"]]
            bf = bf.assign(ida=id_to_use)
            allbs.append(bf)
            print(f"Fuzzy match all pairs: {round(100*i/(input_a_length+1))}%", end="\r")
        allbs = cd.concat(allbs)
    
        allbs_filtered = allbs[allbs["out"]==1][["ida","idb"]]
    
        print('============================')
        print("Join on fuzzy matched keys and show result")
        a_prime = cd.merge(a, allbs_filtered, how="inner", left_on='ida', right_on='ida')
        merged = cd.merge(a_prime, b, how="inner", left_on='idb', right_on='idb')
        print(merged.open())
        print('============================\n\n')


.. parsed-literal::

    using integer mode (rounds every entry from a word vector to a bounded signed integer)
    ============================
    Loading model...
    ============================
    
    
    ============================
    Join on fuzzy matched keys and show result
                                   Name_x       Rank  Age  ida  \
    0  Nobbs, Cecil Wormsborough St. John   Corporal   39    3   
    1             Ironfoundersson, Carrot    Captain   24    2   
    2                       Vimes, Samuel  Commander   52    1   
    
                                                    veca  idb  \
    0  [11, -2, -20, 5, 2, -12, 4, 0, 1, 3, -17, 6, 0...    2   
    1  [9, -4, -17, 6, -2, -8, 1, -4, -3, 0, -15, 0, ...    1   
    2  [10, 0, -21, 5, -4, -14, 3, 0, -2, 5, -19, 3, ...    0   
    
                                  Name_y Favoritefood  Shoesize  \
    0  Nobbs, Cecil Wormsborough St John        Pizza        50   
    1              Ironfounderson, Carot      Sprouts        40   
    2                      Simes, Vamuel    Spaghetti        30   
    
                                                    vecb  
    0  [10, -2, -18, 5, 1, -12, 3, 0, 0, 2, -17, 5, 0...  
    1  [11, -3, -19, 6, -1, -9, 2, -3, -3, 0, -16, 0,...  
    2  [9, 0, -20, 6, -5, -14, 4, 0, -2, 5, -19, 4, -...  
    ============================