fuzzylink-py

This package is an implementation of the R fuzzylink package which implements the probabilistic record linkage procedure described in Ornstein (2025). This package is essentially a way of merging two pandas data frames; however, unlike traditional join methods, fuzzylink-py supports "fuzzy matching" on one field.

Installation

You can install fuzzylink-py using pip by running:

pip install fuzzylink-py

Using fuzzylink-py will require you to use two models--the first will be used to acquire text embeddings to capture the semantic meaning of your fuzzy strings, and the second will be used to generate Y/N labels to identify matches. Fuzzylink-py supports locally running OpenAI-compatible models with valid port numbers (and api keys if needed), however if you wish to use closed-source models, you will need to set up API access.

OpenAI

Sign up for a developer account with OpenAI and create an API key if you do not already have one. OpenAI requires you to buy a certain number of credits before use--$5 is a safe starting number. Once you have your API key, you can either pass it as a parameter when running fuzzylink-py, or enter the following in your terminal:

export OPENAI_API_KEY="EnterYourUniqueKeyHere"

Mistral

You can also use models from Mistral; set up a developers account and generate an API key. If you do not wish to pass this key to all calls to fuzzylink-py, enter the following in your terminal:

export MISTRAL_API_KEY="EnterYourUniqueKeyHere"

Google/Gemini

Google/gemini models are also supported. Once you have a developer account setup, create your API key. Again, you can enter the following command in your terminal if you do not wish to pass the key explicitly in your python code:

export GOOGLE_API_KEY="EnterYourUniqueKeyHere"

Example Usage

Let's look at an example where we want to merge two hypothetical data frames (data_frame1 and data_frame2) containing information about people; we want to block on age and fuzzy match on name.

Assuming you are running OpenAI models for embeddings and label generation, this is the bare-minimum code you would need:

from fuzzylink import FuzzyLink

#create a FuzzyLink instance
linker = FuzzyLink(
    merge_on="name", #column to fuzzy match on
    block_on=["age"], #column(s) to block on
    record_type="person", #the type of record being matched
    embedding_model="text-embedding-3-small", #specific embedding model to use
    embedding_provider="OpenAI", #the provider for the embedding model
    labeling_model="o4-mini", #the specific LLM to use for generating labels
    labeling_provider="OpenAI" #the provider for the labeling model
)

#match the records from data_frame1 with those from data_frame2 (and train a reusable model)
matched_records = linker.train(
    dfA=data_frame1, #the left data frame
    dfB=data_frame2, #the right data frame
    embedding_api_key="YourUniqueKey", #the OpenAI api key if not specified as environment variable
    labeling_api_key="YourUniqueKey" #the OpenAI api key if not specified as environment variable
)

You could achieve the same thing using locally running models as so:

from fuzzylink import FuzzyLink

#create a FuzzyLink instance
linker = FuzzyLink(
   merge_on="name", #column to fuzzy match on
   block_on=["age"], #column(s) to block on
   record_type="person", #the type of record being matched
   embedding_model="Local", #indicate a locally running model should be used for embeddings
   labeling_model="Local",  #indicate a locally running model should be used for label generation
)

#match the records from data_frame1 with those from data_frame2 (and train a reusable model for this data)
matched_records = linker.train(
   dfA=data_frame1, #the left data frame
   dfB=data_frame2, #the right data frame
   embedding_api_key="YourUniqueKey", #the key to the locally running model (if needed)
   embedding_port=8080,
   labeling_api_key="YourUniqueKey", #the key to the locally running model (if needed)
   labeling_port=8081
)

So, what did this do?

When we initialize the FuzzyLink instance, we can make some important decisions about how our probabilistic model will be trained and how our data frames will be merged.

merge_on="name" establishes the column which will be fuzzy matched on (this column must exist in any data frames that this FuzzyLink object is operating on)
block_on=["age"] indicates that the data frames should be blocked by the "age" column--while optional, establishing good blocking columns will significantly speed up this procedure and is highly recommended
record_type="person" provides additional context to the LLM about what the entities being matched are to help improve label generation accuracy
embedding_model and labeling_model (along with embedding_provider and labeling_provider) specify which models are used for acquiring embeddings and generating labels to train the probabilistic model

By just specifying these parameters, we are accepting some default behavior (more details about parameters can be found in section Advanced Customization and Additional Functions) including:

The probabilistic model being used will be a sklearn LogisticRegression model with max_iter=1000, C=np.inf, and tol=1e-8(can be set with classification_model)
The similarity scores used to train the probabilistic model will be cosine similarity of the fuzzy string embeddings and Jaro-Winkler string similarity (additional scores can be added with additional_operations)
The dimensionality of the embeddings will be 256 (can be set with embedding_dimension)
A total of 5000 labels will be generated using the labeling_model (can be set with max_labels)

After this instance has been created (and saved as linker), we can use it. Calling linker.train() accomplishes two main goals: (1) it trains a probabilistic model to identify matches for this specific type of data, and in the process (2) it merges the two input data frames as specified. The parameters specified in this function call essentially just specify the data frames and how to connect to the models. Without specifying any additional things here, the following defaults are kept:

The join type between the left and right input data frames will be an inner join (only matched records will be saved; can be set with how)
The only columns in the final, merged data frame will be those from the original input data frames; columns containing additional information (ie. cosine similarity scores, match probabiliites, etc.) will be discarded (can be changed by setting include_details=True)

So now, you can do whatever you want with matched_records! Additionally, you can save the generated embeddings (which can be used to save compute power in future runs if desired):

# get a dictionary of strings:embeddings for all unique strings in either fuzzy column
embeddings = linker.embeddings_
# save these embeddings if desired
with open("/path/to/save/directory/embeddings.json", "w") as f:
    json.dump(embeddings, f, indent=4)

Similarly, you can save the llm labels that were generated to reuse:

# get a dictionary of (str1, str2): 0/1
labels = linker.labels_
# save labels directly to a json file if desired
linker.save_labels("/path/to/save/directory/labels.json")

To use saved embeddings/labels:

#load the saved embeddings from a file
with open("/path/to/save/directory/embeddings.json", "r") as file:
    STARTING_EMBEDDINGS = json.load(file)
#load the saved labels from a file
STARTING_LABELS = FuzzyLink.load_labels("/path/to/save/directory/labels.json")

linker = FuzzyLink(
        merge_on="name",
        block_on=["last", "place", "cntyname"],
        embedding_model="local",
        record_type="person",
        labeling_model="local",
        starting_embeddings=STARTING_EMBEDDINGS,
        starting_labels=STARTING_LABELS,
    )

And, most importantly, you can use your trained FuzzyLink object on more data frames without going through this whole training process again:

matched_records2 = linker.match(
    dfA=data_frame3,
    dfB=data_frame4,
    embedding_api_key="YourUniqueKey",
    embedding_port=8080, #if running local model
)

If you're curious, you can also see the threshold that was found to determine a match

print(f"Records with a match probability greater than {linker.threshold_} are considered a match")

Citation

If you use this package in your research, please cite the original paper:

Ornstein JT. Probabilistic Record Linkage Using Pretrained Text Embeddings. Political Analysis. Published online 2025:1-12. doi:10.1017/pan.2025.10016

BibTeX:

@article{Ornstein_2025,
  title={Probabilistic Record Linkage Using Pretrained Text Embeddings},
  DOI={10.1017/pan.2025.10016},
  journal={Political Analysis},
  author={Ornstein, Joseph T.},
  year={2025},
  pages={1–12}}

You can also print the citation from within Python:

FuzzyLink.cite()

Advanced Customization and Additional Functions

FuzzyLink.init()

The constructor that is called when creating an instance of the FuzzyLink class. This is where the majority of parameters used to dictate training are specified.

Arguments:

merge_on - the column which is being fuzzy matched on. This column must exist in all data frames used.
embedding_model - the specific model to be used to retrieve embeddings (if closed-source) or Local if using a locally running model (must specify port in train() and match()).
embedding_provider - the provider for the embedding model. If using a locally running embedding_model, you can denote this as Local or omit it. Currently supported closed source providers: OPENAI, MISTRAL, GOOGLE, and GEMINI.
labeling_model - the specific LLM to be used to generate match labels (if closed-source) or Local if using a locally running model (must specify port in train() and match()).
labeling_provider - the provider for the label generation model. If using a locally running labeling_model, you can denote this as Local or omit it. Currently supported closed source providers: OPENAI, MISTRAL, GOOGLE, and GEMINI.
block_on - a list of column names to block on. These column(s) must exist in all data frames used. While not required, blocking will significantly reduce run time and computational costs (and LLM usage costs)
embedding_dimension - the dimensionality of the embedding vector used to calculate cosine similarity scores. Defaults to 256. Supported dimensions are 128, 256, 512, 768, 1024, 1536, and 3072.
record_type - the type of record being matched (ie. person, corporation, etc.). Defaults to "entity."
labeling_context - any additional instructions to give the label generation LLM to help it identify if record pairs are a match. This could be relevant context, data origins, matching instructions, etc. The prompt passed to the LLM to identify matches will be structured like:

f"""Decide if the following two names refer to the same {record_type}. {labeling_context}
Think carefully. Respond with 1 if they refer to the same {record_type} or with 0 if they do not.
Name A: {str1}. Name B: {str2}."""

starting_labels - if you have any pairs which you know are a match or not (or labels saved from a previous run), you can provide these as starting labels here. These labels are in the format dict[tuple(str, str) : 0/1] and will not count towards max_labels.
starting_embeddings - if you have any starting embeddings, you can provide those here in the form dict[str: list[float]]. The length of the embedding must match embedding_dimension.
additional_operations - any additional, user defined functions to be used as input parameters to train the probabilistic model. By default, embedding cosine similarity (param0) and Jaro-Winkler string similarity (param1) are always computed. User-defined operations are appended as param2, param3, etc. These functions can operate using any columns from the left data frame (denoted OriginalColumnName_dfA if shared column) and any columns from the right data frame (denoted OriginalColumnName_dfB if shared column) and should output a single value. Here is an example of an additional similarity function:

def example_function(df):
    return df.apply(
        lambda row: max(row["name_dfA"], row["name_dfB"]), axis=1
    )

classification_model - the probabilistic model that will be trained and used to predict matches. Defaults to a sklearn LogisticRegression model with max_iter=1000, C=np.inf, and tol=1e-8.
max_labels - the total number of new labels that will be generated by the LLM. This can be useful for considering any budget constraints (though cost of embedding retrieval should be taken into account as well). Defaults to 5,000.
initial_train_size - the number of record pairs to be labeled before beginning the training loop. Defaults to 500.
learning_batch_size - the maximum number of record pairs to be labeled in each iteration of the training loop. Defaults to 100.
convergence_threshold - the training loop will exit after the mean of the maximum match probability change over 5 training iterations does not change more by this value. However, labeling will continue outside of the training loop until max_labels is reached. Defaults to 0.01.
sampling_distribution - determines how new pairs are selected to add to the training data, specified as a continuous scipy.stats distribution. Defaults to a gaussian kernel centered at 0.5 with a standard deviation width of 0.2.
train_size - the proportion of data to be used for training the probabilistic model. This defaults to 1.0 (aka using all the data for training and withholding none for testing). If you wish to use a subset of the data for training, you can do so.
random_seed - the seed to be used for all points of randomness throughout this process; defaults to 42. While a seed is used for reproducibility whenever possible, there can still be some inherent randomness when working with models.
embedding_batch_size - the number of strings that will be be embedded in one call to the embeddings model. Increasing this number will speed up the embedding retrieval process, but could cause crashes depending on model limits--check with your specific model/provider. Defaults to 100.
max_workers - the number of concurrent requests that will be made to the labeling model Increasing this number will speed up the labeling process, but could cause crashes depending on model usage limits--check with your specific model/provider. Defaults to 15 if using a local model, 10 if using an OpenAI model, and 5 otherwise.

train()

The function which trains the probabilistic model while simultaneously merging two dataframes

Arguments:

dfA - the "left" data frame.
dfB - the "right" data frame.
how - specifies how the two dataframes will be merged. Options are: "inner" (just keep matched pairs), "outer" (matched pairs plus all records from dfA and dfB), "left" (matched pairs and all records from dfA), and "right" (matched pairs and all records from dfB). Defaults to an inner join.
embedding_api_key - the api key to access the embedding model (if not set as an environment variable).
embedding_port - if using a locally running embedding model, the port number on which the model is running. Defaults to 8080.
labeling_api_key - the api key to access the labeling model (if not set as an environment variable).
labeling_port - if using a locally running labeling LLM, the port number on which the model is running. Defaults to 8081.
include_details - a flag to determine if additional columns are kept in the final, returned merged data frame. These include diagnostic columns like _fl_exact_match, _fl_label, _fl_match_prob, and similarity features. Defaults to False.

match()

Uses the trained probabilistic model to merge two dataframes

Arguments:

dfA - the "left" data frame.
dfB - the "right" data frame.
how - specifies how the two dataframes will be merged. Options are: "inner" (just keep matched pairs), "outer" (matched pairs plus all records from dfA and dfB), "left" (matched pairs and all records from dfA), and "right" (matched pairs and all records from dfB). Defaults to an inner join.
embedding_api_key - the api key to access the embedding model (if not set as an environment variable).
embedding_port - if using a locally running embedding model, the port number on which the model is running. Defaults to 8080.
block_on - a list of column names to block on. If none are provided, but blocking columns were used when initializing the object/training, the original blocking columns will be used.
include_details - a flag to determine if additional columns are kept in the final, returned merged data frame. These include diagnostic columns like _fl_exact_match, _fl_label, _fl_match_prob, and similarity features. Defaults to False.

block()

This is an internally used function which creates the blocks for dfA and dfB. However, if you wish to explore potential blocking variables before running fuzzylink-py, you can call this function directly. Blocks are determined by the intersection of the unique blocking column values of dfA and dfB.

dfA - the "left" data frame.
dfB - the "right" data frame.
block_on - a list of column names to block on.

clone()

Creates an untrained copy of a FuzzyLink object with the same configuration. Any constructor parameter can be overridden by passing it as a keyword argument. The cloned object also inherits the original's cached embeddings and labels, so prior work is preserved while the trained classifier is not.

Arguments:

**overrides - any FuzzyLink.__init__() parameter to change in the clone. Parameters not specified are copied from the original object.

Example:

# train a linker on one dataset
linker = FuzzyLink(merge_on="name", block_on=["state"], ...)
linker.train(dfA=df1, dfB=df2)

# create a copy with a different blocking column, reusing cached embeddings and labels
linker2 = linker.clone(block_on=["city"])
linker2.train(dfA=df3, dfB=df4)

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
.github		.github
docs		docs
src/fuzzylink		src/fuzzylink
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fuzzylink-py

Installation

OpenAI

Mistral

Google/Gemini

Example Usage

Citation

Advanced Customization and Additional Functions

FuzzyLink.init()

Arguments:

train()

Arguments:

match()

Arguments:

block()

clone()

Arguments:

Example:

About

Uh oh!

Releases 4

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fuzzylink-py

Installation

OpenAI

Mistral

Google/Gemini

Example Usage

Citation

Advanced Customization and Additional Functions

FuzzyLink.init()

Arguments:

train()

Arguments:

match()

Arguments:

block()

clone()

Arguments:

Example:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages