imagededup

😎 Finding duplicate images made easy!

5,414

466

5,414

View on GitHub

Top Related Projects

dupeguru

6,316

Find duplicate files

Quick Overview

Imagededup is a Python package that provides functionality to find duplicate and near-duplicate images. It offers a variety of algorithms for image hashing and similarity detection, making it useful for tasks such as deduplication, organization, and analysis of large image datasets.

Pros

Supports multiple image hashing algorithms (perceptual, difference, average, wavelet)
Includes both exact and near-duplicate detection capabilities
Provides convenient methods for plotting and visualizing results
Offers flexibility in handling various image formats and directory structures

Cons

May require significant computational resources for large datasets
Limited to image-based deduplication (not suitable for other file types)
Accuracy can vary depending on the chosen algorithm and similarity threshold
Requires some understanding of image hashing concepts for optimal use

Code Examples

Basic usage for finding duplicate images in a directory:

from imagededup.methods import PHash
from imagededup.utils import plot_duplicates

phasher = PHash()
duplicates = phasher.find_duplicates(image_dir='path/to/images/')
plot_duplicates(image_dir='path/to/images/', duplicate_map=duplicates)

Generating image hashes and finding duplicates:

from imagededup.methods import DHash

dhasher = DHash()
encodings = dhasher.encode_images(image_dir='path/to/images/')
duplicates = dhasher.find_duplicates(encoding_map=encodings)

Finding similar images using a custom threshold:

from imagededup.methods import AHash

ahasher = AHash()
duplicates = ahasher.find_duplicates(image_dir='path/to/images/', 
                                     max_distance_threshold=10)

Getting Started

To get started with imagededup, follow these steps:

Install the package:
```
pip install imagededup
```
Import the desired hashing method:
```
from imagededup.methods import PHash
```
Create an instance of the hashing method:
```
phasher = PHash()
```

Find duplicates in a directory:

duplicates = phasher.find_duplicates(image_dir='path/to/images/')

Process the results as needed:

for filename, duplicate_list in duplicates.items():
    print(f"Original: {filename}")
    print(f"Duplicates: {duplicate_list}")

Competitor Comparisons

dupeguru

6,316

Find duplicate files

Pros of dupeGuru

Supports multiple file types (images, music, documents)
User-friendly GUI for easier navigation and file management
Offers more advanced filtering and sorting options

Cons of dupeGuru

Less focused on image-specific deduplication techniques
May be slower for large-scale image deduplication tasks
Limited programmatic integration options compared to imagededup

Code Comparison

dupeGuru (Python):

def get_file_hash(path):
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

imagededup (Python):

def hash_func(image):
    return imagehash.average_hash(Image.open(image))

Summary

dupeGuru is a versatile tool with a user-friendly interface, suitable for various file types and casual users. imagededup, on the other hand, is more specialized for image deduplication, offering better performance and integration options for developers and data scientists working specifically with image datasets. The code comparison shows that dupeGuru uses a general file hashing approach, while imagededup employs image-specific hashing techniques for more accurate image comparison.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Image Deduplicator (imagededup)

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

Finding duplicates in a directory using one of the following algorithms:
- Convolutional Neural Network (CNN) - Select from several prepackaged models or provide your own custom model.
- Perceptual hashing (PHash)
- Difference hashing (DHash)
- Wavelet hashing (WHash)
- Average hashing (AHash)
Generation of encodings for images using one of the above stated algorithms.
Framework to evaluate effectiveness of deduplication given a ground truth mapping.
Plotting duplicates found for a given image file.

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.8+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.

âï¸ Installation

There are two ways to install imagededup:

Install imagededup from PyPI (recommended):

pip install imagededup

Install imagededup from the GitHub source:

git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install .

ð Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

Import perceptual hashing method

from imagededup.methods import PHash
phasher = PHash()

Generate encodings for all images in an image directory

encodings = phasher.encode_images(image_dir='path/to/image/directory')

Find duplicates using the generated encodings

duplicates = phasher.find_duplicates(encoding_map=encodings)

Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary

from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

It is also possible to use your own custom models for finding duplicates using the CNN method.

For examples, refer this part of the repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

â³ Benchmarks

Update: Provided benchmarks are only valid upto imagededup v0.2.2. The next releases have significant changes to all methods, so the current benchmarks may not hold.

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:

CNN works best for near duplicates and datasets containing transformations.
All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

ð¤ Contribute

We welcome all kinds of contributions. See the Contribution guide for more details.

ð Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

ð Maintainers

Tanuj Jain, github: tanujjain
Christopher Lennan, github: clennan
Dat Tran, github: datitran

Â© Copyright

See LICENSE for details.

Top Related Projects

dupeguru

6,316

Find duplicate files

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

imagededup

Top Related Projects

dupeguru

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

dupeguru

Pros of dupeGuru

Cons of dupeGuru

Code Comparison

Summary

Convert designs to code with AI

README

Image Deduplicator (imagededup)

ð Contents

âï¸ Installation

ð Quick Start

â³ Benchmarks

ð¤ Contribute

ð Citation

ð Maintainers

Â© Copyright

Top Related Projects

dupeguru

Convert designs to code with AI

ð Contents

âï¸ Installation

ð Quick Start

â³ Benchmarks

ð¤ Contribute

ð Citation

ð Maintainers