jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.

2,143

159

2,143

View on GitHub View on NPM

Top Related Projects

fastText

26,183

Library for fast text representation and classification.

Quick Overview

Jellyfish is a Python library that provides a set of functions for calculating distance metrics between strings, such as Levenshtein distance, Jaro-Winkler distance, and Soundex. It is designed to be fast, efficient, and easy to use, making it a useful tool for tasks like spell-checking, data cleaning, and record linkage.

Pros

Fast and Efficient: Jellyfish is written in C and provides a Python interface, making it significantly faster than pure Python implementations of the same algorithms.
Comprehensive Functionality: The library includes a wide range of distance metrics and string comparison functions, covering a variety of use cases.
Easy to Use: Jellyfish has a simple and intuitive API, making it easy to integrate into existing projects.
Well-Documented: The project has detailed documentation, including examples and usage guides, making it easy to get started.

Cons

Limited to String Comparisons: Jellyfish is focused solely on string distance metrics and does not provide any other functionality beyond that.
Dependency on C: While the C implementation provides performance benefits, it also introduces a dependency that may be a barrier for some users.
Potential Compatibility Issues: As a low-level library, Jellyfish may be susceptible to compatibility issues with different versions of Python or other dependencies.
Limited Customization: The library provides a fixed set of distance metrics and does not allow for easy customization or extension of the available functions.

Code Examples

Here are a few examples of how to use Jellyfish:

import jellyfish

# Calculate the Levenshtein distance between two strings
distance = jellyfish.levenshtein_distance("hello", "world")
print(distance)  # Output: 4

# Calculate the Jaro-Winkler similarity between two strings
similarity = jellyfish.jaro_winkler("John Smith", "Jon Smythe")
print(similarity)  # Output: 0.9392857142857143

# Soundex encoding of a string
soundex = jellyfish.soundex("Jellyfish")
print(soundex)  # Output: "J412"

Getting Started

To get started with Jellyfish, you can install it using pip:

pip install jellyfish

Once installed, you can import the library and start using its functions. Here's an example of how to use the levenshtein_distance function:

import jellyfish

word1 = "hello"
word2 = "world"
distance = jellyfish.levenshtein_distance(word1, word2)
print(f"The Levenshtein distance between '{word1}' and '{word2}' is {distance}")

This will output:

The Levenshtein distance between 'hello' and 'world' is 4

You can find more examples and documentation in the Jellyfish GitHub repository.

Competitor Comparisons

fastText

26,183

Library for fast text representation and classification.

Pros of fastText

fastText is a highly efficient and scalable library for text representation and classification, capable of handling large-scale datasets.
It provides pre-trained word vectors for a variety of languages, which can be used for various NLP tasks without the need for extensive training.
fastText supports a wide range of applications, including text classification, word analogies, and sentence representation.

Cons of fastText

fastText is primarily focused on text-based tasks and may not be as versatile as Jellyfish, which covers a broader range of string similarity and distance metrics.
The documentation and community support for fastText may not be as extensive as for some other popular NLP libraries.

Code Comparison

Jellyfish (jamesturk/jellyfish):

from jellyfish import jaro_winkler
jaro_winkler('jellyfish', 'smellyfish')
# Output: 0.8400000000000001

fastText (facebookresearch/fastText):

import fasttext
model = fasttext.load_model('cc.en.300.bin')
model.get_word_vector('dog')
# Output: array([-0.0235,  0.0493, -0.0266, ...,  0.0249, -0.0408,  0.0481], dtype=float32)

fuzzywuzzy

9,254

Fuzzy String Matching in Python

Pros of FuzzyWuzzy

Flexible Matching Algorithms: FuzzyWuzzy provides a variety of matching algorithms, including Levenshtein distance, Jaro-Winkler distance, and Partial Ratio, allowing for more robust string comparison.
Extensive Documentation: The FuzzyWuzzy project has detailed documentation, including usage examples and explanations of the different matching techniques.
Active Development: FuzzyWuzzy has a larger and more active community, with more frequent updates and bug fixes compared to Jellyfish.

Cons of FuzzyWuzzy

Dependency on the Difflib Library: FuzzyWuzzy relies on the Difflib library, which may not be available on all platforms or in all environments.
Potentially Slower Performance: FuzzyWuzzy's more advanced matching algorithms may be slower than the simpler approaches used in Jellyfish, especially for large datasets.
Limited Functionality: While FuzzyWuzzy excels at string matching, it may not provide the same breadth of functionality as Jellyfish, which covers a wider range of text processing tasks.

Code Comparison

Jellyfish:

from jellyfish import jaro_distance

print(jaro_distance("jellyfish", "seallyfish"))  # Output: 0.9444444444444444

FuzzyWuzzy:

from fuzzywuzzy import fuzz

print(fuzz.ratio("jellyfish", "seallyfish"))  # Output: 88

Both libraries provide similar functionality for string comparison, but the specific algorithms and output formats may differ. Jellyfish focuses on a more narrow set of core text processing tasks, while FuzzyWuzzy offers a broader range of matching techniques.

thefuzz

3,299

Fuzzy String Matching in Python

Pros of Thefuzz

Thefuzz provides a more comprehensive set of string similarity algorithms, including Levenshtein, Jaro-Winkler, and Soundex, among others.
The library has a larger user base and more active development, with more contributors and a higher number of stars on GitHub.
Thefuzz offers a more intuitive and user-friendly API, with clearer documentation and examples.

Cons of Thefuzz

Jellyfish has a smaller codebase and may be more lightweight and efficient for certain use cases.
Jellyfish is written in pure Python, while Thefuzz has dependencies on the fuzzywuzzy library, which may introduce additional complexity.
The performance of Thefuzz may be slightly slower than Jellyfish for certain string comparison tasks.

Code Comparison

Jellyfish:

from jellyfish import levenshtein_distance
levenshtein_distance("hello", "world")  # Output: 4

Thefuzz:

from thefuzz import fuzz
fuzz.levenshtein("hello", "world")  # Output: 4

As you can see, the APIs for the two libraries are quite similar, with Thefuzz providing a slightly more concise and intuitive interface for the Levenshtein distance calculation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Overview

jellyfish is a library for approximate & phonetic matching of strings.

Source: https://github.com/jamesturk/jellyfish

Documentation: https://jamesturk.github.io/jellyfish/

Issues: https://github.com/jamesturk/jellyfish/issues

Included Algorithms

String comparison:

Levenshtein Distance
Damerau-Levenshtein Distance
Jaccard Index
Jaro Distance
Jaro-Winkler Distance
Match Rating Approach Comparison
Hamming Distance

Phonetic encoding:

American Soundex
Metaphone
NYSIIS (New York State Identification and Intelligence System)
Match Rating Codex

Example Usage

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
>>> jellyfish.jaro_similarity('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1

>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Menu

jellyfish

Top Related Projects

fastText

fuzzywuzzy

thefuzz

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

fastText

Pros of fastText

Cons of fastText

Code Comparison

fuzzywuzzy

Pros of FuzzyWuzzy

Cons of FuzzyWuzzy

Code Comparison

thefuzz

Pros of Thefuzz

Cons of Thefuzz

Code Comparison

Convert designs to code with AI

README

Overview

Included Algorithms

Example Usage

Top Related Projects

fastText

fuzzywuzzy

thefuzz

Convert designs to code with AI

NPM DownloadsLast 30 Days