thefuzz

Fuzzy String Matching in Python

3,299

155

3,299

View on GitHub View on NPM

Top Related Projects

jellyfish

2,143

🪼 a python library for doing approximate and phonetic matching of strings.

Quick Overview

The Fuzz is a Python library that provides a set of functions to perform fuzzy string matching and comparison. It is designed to be fast and accurate, and can be used for a variety of tasks such as spell-checking, data cleaning, and record linkage.

Pros

Flexible and Customizable: The Fuzz library provides a wide range of functions and options for customizing the fuzzy matching process, allowing users to fine-tune the algorithm to their specific needs.
High Performance: The library is written in Cython, which makes it fast and efficient, even when working with large datasets.
Extensive Documentation: The project has detailed documentation that covers the various functions and use cases, making it easy for developers to get started.
Active Development: The project is actively maintained, with regular updates and bug fixes, ensuring that it remains up-to-date and reliable.

Cons

Limited Language Support: The Fuzz library is primarily focused on English-language text, and may not perform as well with other languages or character sets.
Potential for Inaccurate Matches: Fuzzy string matching can sometimes produce unexpected or inaccurate results, especially for complex or ambiguous text.
Dependency on External Libraries: The Fuzz library relies on several external libraries, such as NumPy and SciPy, which can increase the complexity of the installation and setup process.
Lack of Parallelization: The library does not currently support parallel processing, which can limit its performance when working with very large datasets.

Code Examples

Here are a few examples of how to use the Fuzz library:

from fuzzywuzzy import fuzz

# Compute the similarity between two strings
similarity = fuzz.ratio("hello", "world")
print(similarity)  # Output: 0

from fuzzywuzzy import process

# Find the best match for a given string in a list of options
options = ["apple", "banana", "cherry"]
best_match = process.extract("appl", options, limit=1)[0][0]
print(best_match)  # Output: 'apple'

from fuzzywuzzy import fuzz

# Use a custom scoring function for fuzzy matching
def custom_scorer(s1, s2):
    return fuzz.ratio(s1, s2) * 0.8 + fuzz.partial_ratio(s1, s2) * 0.2

similarity = custom_scorer("hello", "world")
print(similarity)  # Output: 20.0

from fuzzywuzzy import process

# Perform fuzzy string matching on a list of dictionaries
data = [
    {"name": "John Doe", "email": "john.doe@example.com"},
    {"name": "Jane Smith", "email": "jane.smith@example.com"},
    {"name": "Bob Johnson", "email": "bob.johnson@example.com"}
]

match = process.extract("John Doe", data, scorer=fuzz.token_sort_ratio, limit=1)[0]
print(match)  # Output: ('John Doe', 100)

Getting Started

To get started with the Fuzz library, you can install it using pip:

pip install fuzzywuzzy

Once installed, you can import the necessary modules and start using the library. Here's a simple example:

from fuzzywuzzy import fuzz

# Compute the similarity between two strings
similarity = fuzz.ratio("hello", "world")
print(similarity)  # Output: 0

For more advanced usage and customization, you can refer to the project's documentation.

Competitor Comparisons

fuzzywuzzy

9,254

Fuzzy String Matching in Python

Pros of FuzzyWuzzy

FuzzyWuzzy provides a more comprehensive set of string matching algorithms, including Levenshtein distance, Jaro-Winkler distance, and Soundex.
FuzzyWuzzy has better support for handling Unicode characters and non-ASCII text.
FuzzyWuzzy offers more customization options, such as the ability to set thresholds and weights for different matching algorithms.

Cons of FuzzyWuzzy

TheFuzz has a smaller codebase and may be more lightweight and efficient for simple use cases.
TheFuzz has a simpler API and may be easier to get started with for beginners.
FuzzyWuzzy may have a slightly higher learning curve due to its more extensive feature set.

Code Comparison

TheFuzz:

from thefuzz import fuzz

fuzz.ratio("hello", "world")  # 0
fuzz.partial_ratio("hello", "world")  # 0

FuzzyWuzzy:

from fuzzywuzzy import fuzz

fuzz.ratio("hello", "world")  # 0
fuzz.partial_ratio("hello", "world")  # 0

As you can see, the API for both libraries is very similar, with the main difference being the library name (TheFuzz vs FuzzyWuzzy).

jellyfish

2,143

🪼 a python library for doing approximate and phonetic matching of strings.

Pros of Jellyfish

Jellyfish supports a wider range of string similarity algorithms, including Levenshtein, Jaro-Winkler, and Soundex, among others.
Jellyfish has better performance and scalability compared to TheFuzz, especially for larger datasets.
Jellyfish provides more detailed and informative error messages, which can be helpful for debugging.

Cons of Jellyfish

TheFuzz has a more user-friendly and intuitive API, making it easier to get started with for some users.
TheFuzz has a larger and more active community, with more third-party integrations and resources available.
Jellyfish may have a steeper learning curve for users who are not familiar with the various string similarity algorithms it supports.

Code Comparison

TheFuzz:

from fuzzywuzzy import fuzz
fuzz.ratio("hello", "world")  # Output: 0

Jellyfish:

import jellyfish
jellyfish.levenshtein_distance("hello", "world")  # Output: 4

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. image:: https://github.com/seatgeek/thefuzz/actions/workflows/ci.yml/badge.svg :target: https://github.com/seatgeek/thefuzz

TheFuzz

Fuzzy string matching like a boss. It uses Levenshtein Distance <https://en.wikipedia.org/wiki/Levenshtein_distance>_ to calculate the differences between sequences in a simple-to-use package.

Requirements

Python 3.8 or higher
rapidfuzz <https://github.com/maxbachmann/RapidFuzz/>_

For testing

-  pycodestyle
-  hypothesis
-  pytest

Installation
============

Using pip via PyPI

.. code:: bash

    pip install thefuzz


Using pip via GitHub

.. code:: bash

    pip install git+git://github.com/seatgeek/thefuzz.git@0.19.0#egg=thefuzz

Adding to your ``requirements.txt`` file (run ``pip install -r requirements.txt`` afterwards)

.. code:: bash

    git+ssh://git@github.com/seatgeek/thefuzz.git@0.19.0#egg=thefuzz

Manually via GIT

.. code:: bash

    git clone git://github.com/seatgeek/thefuzz.git thefuzz
    cd thefuzz
    python setup.py install


Usage
=====

.. code:: python

    >>> from thefuzz import fuzz
    >>> from thefuzz import process

Simple Ratio

.. code:: python

>>> fuzz.ratio("this is a test", "this is a test!")
    97

Partial Ratio


.. code:: python

    >>> fuzz.partial_ratio("this is a test", "this is a test!")
        100

Token Sort Ratio

.. code:: python

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio


.. code:: python

    >>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
        84
    >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
        100

Partial Token Sort Ratio

.. code:: python

>>> fuzz.token_sort_ratio("fuzzy was a bear", "wuzzy fuzzy was a bear")
    84
>>> fuzz.partial_token_sort_ratio("fuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Process


.. code:: python

    >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
    >>> process.extract("new york jets", choices, limit=2)
        [('New York Jets', 100), ('New York Giants', 78)]
    >>> process.extractOne("cowboys", choices)
        ("Dallas Cowboys", 90)

You can also pass additional parameters to ``extractOne`` method to make it use a specific scorer. A typical use case is to match file paths:

.. code:: python

    >>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
        ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
    >>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
        ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

.. |Build Status| image:: https://github.com/seatgeek/thefuzz/actions/workflows/ci.yml/badge.svg
   :target: https://github.com/seatgeek/thefuzz

Top Related Projects

fuzzywuzzy

9,254

Fuzzy String Matching in Python

jellyfish

2,143

🪼 a python library for doing approximate and phonetic matching of strings.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot