This package is an implementation of the R fuzzylink package which implements the probabilistic record linkage procedure described in Ornstein (2025). This package is essentially a way of merging two pandas data frames; however, unlike traditional join methods, fuzzylink-py supports "fuzzy matching" on one field.
You can install fuzzylink-py using pip by running:
pip install fuzzylink-py
Using fuzzylink-py will require you to use two models--the first will be used to acquire text embeddings to capture the semantic meaning of your fuzzy strings, and the second will be used to generate Y/N labels to identify matches. Fuzzylink-py supports locally running OpenAI-compatible models with valid port numbers (and api keys if needed), however if you wish to use closed-source models, you will need to set up API access.
Sign up for a developer account with OpenAI and create an API key if you do not already have one. OpenAI requires you to buy a certain number of credits before use--$5 is a safe starting number. Once you have your API key, you can either pass it as a parameter when running fuzzylink-py, or enter the following in your terminal:
export OPENAI_API_KEY="EnterYourUniqueKeyHere"
You can also use models from Mistral; set up a developers account and generate an API key. If you do not wish to pass this key to all calls to fuzzylink-py, enter the following in your terminal:
export MISTRAL_API_KEY="EnterYourUniqueKeyHere"
Google/gemini models are also supported. Once you have a developer account setup, create your API key. Again, you can enter the following command in your terminal if you do not wish to pass the key explicitly in your python code:
export GOOGLE_API_KEY="EnterYourUniqueKeyHere"
Let's look at an example where we want to merge two hypothetical data frames (data_frame1 and data_frame2) containing information about people; we want to block on age and fuzzy match on name.
Assuming you are running OpenAI models for embeddings and label generation, this is the bare-minimum code you would need:
from fuzzylink import FuzzyLink
#create a FuzzyLink instance
linker = FuzzyLink(
merge_on="name", #column to fuzzy match on
block_on=["age"], #column(s) to block on
record_type="person", #the type of record being matched
embedding_model="text-embedding-3-small", #specific embedding model to use
embedding_provider="OpenAI", #the provider for the embedding model
labeling_model="o4-mini", #the specific LLM to use for generating labels
labeling_provider="OpenAI" #the provider for the labeling model
)
#match the records from data_frame1 with those from data_frame2 (and train a reusable model)
matched_records = linker.train(
dfA=data_frame1, #the left data frame
dfB=data_frame2, #the right data frame
embedding_api_key="YourUniqueKey", #the OpenAI api key if not specified as environment variable
labeling_api_key="YourUniqueKey" #the OpenAI api key if not specified as environment variable
)You could achieve the same thing using locally running models as so:
from fuzzylink import FuzzyLink
#create a FuzzyLink instance
linker = FuzzyLink(
merge_on="name", #column to fuzzy match on
block_on=["age"], #column(s) to block on
record_type="person", #the type of record being matched
embedding_model="Local", #indicate a locally running model should be used for embeddings
labeling_model="Local", #indicate a locally running model should be used for label generation
)
#match the records from data_frame1 with those from data_frame2 (and train a reusable model for this data)
matched_records = linker.train(
dfA=data_frame1, #the left data frame
dfB=data_frame2, #the right data frame
embedding_api_key="YourUniqueKey", #the key to the locally running model (if needed)
embedding_port=8080,
labeling_api_key="YourUniqueKey", #the key to the locally running model (if needed)
labeling_port=8081
)So, what did this do?
When we initialize the FuzzyLink instance, we can make some important decisions about how our probabilistic model will be trained and how our data frames will be merged.
merge_on="name"establishes the column which will be fuzzy matched on (this column must exist in any data frames that this FuzzyLink object is operating on)block_on=["age"]indicates that the data frames should be blocked by the "age" column--while optional, establishing good blocking columns will significantly speed up this procedure and is highly recommendedrecord_type="person"provides additional context to the LLM about what the entities being matched are to help improve label generation accuracyembedding_modelandlabeling_model(along withembedding_providerandlabeling_provider) specify which models are used for acquiring embeddings and generating labels to train the probabilistic model
By just specifying these parameters, we are accepting some default behavior (more details about parameters can be found in section Advanced Customization and Additional Functions) including:
- The probabilistic model being used will be a sklearn LogisticRegression model with
max_iter=1000,C=np.inf, andtol=1e-8(can be set withclassification_model) - The similarity scores used to train the probabilistic model will be cosine similarity of the fuzzy string embeddings and Jaro-Winkler string similarity (additional scores can be added with
additional_operations) - The dimensionality of the embeddings will be 256 (can be set with
embedding_dimension) - A total of
5000labels will be generated using thelabeling_model(can be set withmax_labels)
After this instance has been created (and saved as linker), we can use it. Calling linker.train() accomplishes two main goals: (1) it trains a probabilistic model to identify matches for this specific type of data, and in the process (2) it merges the two input data frames as specified. The parameters specified in this function call essentially just specify the data frames and how to connect to the models. Without specifying any additional things here, the following defaults are kept:
- The join type between the left and right input data frames will be an
inner join(only matched records will be saved; can be set withhow) - The only columns in the final, merged data frame will be those from the original input data frames; columns containing additional information (ie. cosine similarity scores, match probabiliites, etc.) will be discarded (can be changed by setting
include_details=True)
So now, you can do whatever you want with matched_records! Additionally, you can save the generated embeddings (which can be used to save compute power in future runs if desired):
# get a dictionary of strings:embeddings for all unique strings in either fuzzy column
embeddings = linker.embeddings_
# save these embeddings if desired
with open("/path/to/save/directory/embeddings.json", "w") as f:
json.dump(embeddings, f, indent=4)Similarly, you can save the llm labels that were generated to reuse:
# get a dictionary of (str1, str2): 0/1
labels = linker.labels_
# save labels directly to a json file if desired
linker.save_labels("/path/to/save/directory/labels.json")To use saved embeddings/labels:
#load the saved embeddings from a file
with open("/path/to/save/directory/embeddings.json", "r") as file:
STARTING_EMBEDDINGS = json.load(file)
#load the saved labels from a file
STARTING_LABELS = FuzzyLink.load_labels("/path/to/save/directory/labels.json")
linker = FuzzyLink(
merge_on="name",
block_on=["last", "place", "cntyname"],
embedding_model="local",
record_type="person",
labeling_model="local",
starting_embeddings=STARTING_EMBEDDINGS,
starting_labels=STARTING_LABELS,
)And, most importantly, you can use your trained FuzzyLink object on more data frames without going through this whole training process again:
matched_records2 = linker.match(
dfA=data_frame3,
dfB=data_frame4,
embedding_api_key="YourUniqueKey",
embedding_port=8080, #if running local model
)If you're curious, you can also see the threshold that was found to determine a match
print(f"Records with a match probability greater than {linker.threshold_} are considered a match")If you use this package in your research, please cite the original paper:
Ornstein JT. Probabilistic Record Linkage Using Pretrained Text Embeddings. Political Analysis. Published online 2025:1-12. doi:10.1017/pan.2025.10016
BibTeX:
@article{Ornstein_2025,
title={Probabilistic Record Linkage Using Pretrained Text Embeddings},
DOI={10.1017/pan.2025.10016},
journal={Political Analysis},
author={Ornstein, Joseph T.},
year={2025},
pages={1–12}}You can also print the citation from within Python:
FuzzyLink.cite()The constructor that is called when creating an instance of the FuzzyLink class. This is where the majority of parameters used to dictate training are specified.
merge_on- the column which is being fuzzy matched on. This column must exist in all data frames used.embedding_model- the specific model to be used to retrieve embeddings (if closed-source) orLocalif using a locally running model (must specify port intrain()andmatch()).embedding_provider- the provider for the embedding model. If using a locally runningembedding_model, you can denote this asLocalor omit it. Currently supported closed source providers:OPENAI,MISTRAL,GOOGLE, andGEMINI.labeling_model- the specific LLM to be used to generate match labels (if closed-source) orLocalif using a locally running model (must specify port intrain()andmatch()).labeling_provider- the provider for the label generation model. If using a locally runninglabeling_model, you can denote this asLocalor omit it. Currently supported closed source providers:OPENAI,MISTRAL,GOOGLE, andGEMINI.block_on- a list of column names to block on. These column(s) must exist in all data frames used. While not required, blocking will significantly reduce run time and computational costs (and LLM usage costs)embedding_dimension- the dimensionality of the embedding vector used to calculate cosine similarity scores. Defaults to256. Supported dimensions are128,256,512,768,1024,1536, and3072.record_type- the type of record being matched (ie. person, corporation, etc.). Defaults to "entity."labeling_context- any additional instructions to give the label generation LLM to help it identify if record pairs are a match. This could be relevant context, data origins, matching instructions, etc. The prompt passed to the LLM to identify matches will be structured like:
f"""Decide if the following two names refer to the same {record_type}. {labeling_context}
Think carefully. Respond with 1 if they refer to the same {record_type} or with 0 if they do not.
Name A: {str1}. Name B: {str2}."""starting_labels- if you have any pairs which you know are a match or not (or labels saved from a previous run), you can provide these as starting labels here. These labels are in the format dict[tuple(str, str) : 0/1] and will not count towardsmax_labels.starting_embeddings- if you have any starting embeddings, you can provide those here in the form dict[str: list[float]]. The length of the embedding must matchembedding_dimension.additional_operations- any additional, user defined functions to be used as input parameters to train the probabilistic model. By default, embedding cosine similarity (param0) and Jaro-Winkler string similarity (param1) are always computed. User-defined operations are appended asparam2,param3, etc. These functions can operate using any columns from the left data frame (denotedOriginalColumnName_dfAif shared column) and any columns from the right data frame (denotedOriginalColumnName_dfBif shared column) and should output a single value. Here is an example of an additional similarity function:
def example_function(df):
return df.apply(
lambda row: max(row["name_dfA"], row["name_dfB"]), axis=1
)classification_model- the probabilistic model that will be trained and used to predict matches. Defaults to a sklearn LogisticRegression model withmax_iter=1000,C=np.inf, andtol=1e-8.max_labels- the total number of new labels that will be generated by the LLM. This can be useful for considering any budget constraints (though cost of embedding retrieval should be taken into account as well). Defaults to 5,000.initial_train_size- the number of record pairs to be labeled before beginning the training loop. Defaults to 500.learning_batch_size- the maximum number of record pairs to be labeled in each iteration of the training loop. Defaults to 100.convergence_threshold- the training loop will exit after the mean of the maximum match probability change over 5 training iterations does not change more by this value. However, labeling will continue outside of the training loop untilmax_labelsis reached. Defaults to 0.01.sampling_distribution- determines how new pairs are selected to add to the training data, specified as a continuous scipy.stats distribution. Defaults to a gaussian kernel centered at 0.5 with a standard deviation width of 0.2.train_size- the proportion of data to be used for training the probabilistic model. This defaults to 1.0 (aka using all the data for training and withholding none for testing). If you wish to use a subset of the data for training, you can do so.random_seed- the seed to be used for all points of randomness throughout this process; defaults to 42. While a seed is used for reproducibility whenever possible, there can still be some inherent randomness when working with models.embedding_batch_size- the number of strings that will be be embedded in one call to the embeddings model. Increasing this number will speed up the embedding retrieval process, but could cause crashes depending on model limits--check with your specific model/provider. Defaults to 100.max_workers- the number of concurrent requests that will be made to the labeling model Increasing this number will speed up the labeling process, but could cause crashes depending on model usage limits--check with your specific model/provider. Defaults to 15 if using a local model, 10 if using an OpenAI model, and 5 otherwise.
The function which trains the probabilistic model while simultaneously merging two dataframes
dfA- the "left" data frame.dfB- the "right" data frame.how- specifies how the two dataframes will be merged. Options are: "inner" (just keep matched pairs), "outer" (matched pairs plus all records fromdfAanddfB), "left" (matched pairs and all records fromdfA), and "right" (matched pairs and all records fromdfB). Defaults to an inner join.embedding_api_key- the api key to access the embedding model (if not set as an environment variable).embedding_port- if using a locally running embedding model, the port number on which the model is running. Defaults to 8080.labeling_api_key- the api key to access the labeling model (if not set as an environment variable).labeling_port- if using a locally running labeling LLM, the port number on which the model is running. Defaults to 8081.include_details- a flag to determine if additional columns are kept in the final, returned merged data frame. These include diagnostic columns like_fl_exact_match,_fl_label,_fl_match_prob, and similarity features. Defaults toFalse.
Uses the trained probabilistic model to merge two dataframes
dfA- the "left" data frame.dfB- the "right" data frame.how- specifies how the two dataframes will be merged. Options are: "inner" (just keep matched pairs), "outer" (matched pairs plus all records fromdfAanddfB), "left" (matched pairs and all records fromdfA), and "right" (matched pairs and all records fromdfB). Defaults to an inner join.embedding_api_key- the api key to access the embedding model (if not set as an environment variable).embedding_port- if using a locally running embedding model, the port number on which the model is running. Defaults to 8080.block_on- a list of column names to block on. If none are provided, but blocking columns were used when initializing the object/training, the original blocking columns will be used.include_details- a flag to determine if additional columns are kept in the final, returned merged data frame. These include diagnostic columns like_fl_exact_match,_fl_label,_fl_match_prob, and similarity features. Defaults toFalse.
This is an internally used function which creates the blocks for dfA and dfB. However, if you wish to explore potential blocking variables before running fuzzylink-py, you can call this function directly. Blocks are determined by the intersection of the unique blocking column values of dfA and dfB.
dfA- the "left" data frame.dfB- the "right" data frame.block_on- a list of column names to block on.
Creates an untrained copy of a FuzzyLink object with the same configuration. Any constructor parameter can be overridden by passing it as a keyword argument. The cloned object also inherits the original's cached embeddings and labels, so prior work is preserved while the trained classifier is not.
**overrides- anyFuzzyLink.__init__()parameter to change in the clone. Parameters not specified are copied from the original object.
# train a linker on one dataset
linker = FuzzyLink(merge_on="name", block_on=["state"], ...)
linker.train(dfA=df1, dfB=df2)
# create a copy with a different blocking column, reusing cached embeddings and labels
linker2 = linker.clone(block_on=["city"])
linker2.train(dfA=df3, dfB=df4)