fastdup Quick Start


fastdup is a tool designed to help curate and improve the quality of large image datasets, thereby enhancing the performance of machine learning models. It uses advanced algorithms to quickly analyze and highlight issues such as duplicates, mislabeled, blurry and outlier images.

Getting started with fastdup is extremely fast and easy. To get started, simply follow these steps:

Install fastdup

whether you're using Pypi, mac or Ubuntu - Installing fastdup is straightforward:

brew install [email protected]
python3.x -m pip install --upgrade pip
python3.x -m pip install fastdup
sudo apt update
sudo apt -y install software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
sudo apt -y install python3.8
sudo apt -y install python3-pip
sudo apt -y install libgl1-mesa-glx
python3.8 -m pip install --upgrade pip
sudo yum -y install epel-release
sudo yum -y update
sudo yum -y groupinstall "Development Tools"
sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel
sudo yum -y install wget
sudo yum install redhat-lsb-core # for lsb_release
sudo yum install -y ffmpeg ffmpeg-devel # for video support

python3.10 -m pip install <path of the downloaded whl> // see details in installing fastdup
// Install WSL (in PowerShell)

// In your Linux terminal
sudo apt update
sudo apt -y install software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
sudo apt -y install python3.10
sudo apt -y install python3-pip
sudo apt -y install libgl1-mesa-glx
pip3 install --upgrade pip
python3.10 -m pip install fastdup

import fastdup

Once you installed fastdup, you need to import it in your python environment:

import fastdup

Create a Dataset object

Use fastdup.create() to create a dataset object. This function receives an input directory pointing to where the data is, and a work directory, where fastdup files are created:

fd = fastdup.create(work_dir="fastdup_work_dir/", input_dir="images/")

Analyze your data

Use the run() function of the dataset object to analyze the data in the input directory:

fd.run()

Visualize results

Use the explore() function of the dataset object to visualize analysis results and see similarity clusters and detected issues:

fd.explore()