Skip to main content
Open in Colab Open in Kaggle

Install fastdup

First, install fastdup and verify the installation.
pip install fastdup
Now, test the installation by printing the version. If there’s no error message, we are ready to go.
import fastdup
fastdup.__version__
'1.38'

Roboflow Universe

Roboflow Universe hosts over 200,000 computer vision datasets. In order to download datasets from Roboflow Universe, sign-up for free. Now, head over to https://universe.roboflow.com/ to search for the dataset of interest. Once you find a dataset, click on the ‘Download Dataset’ button on the dataset page. A pop-up will appear with a code snippet to download the dataset into your local machine. Copy the code snippet.
🚧 Warning The code snippet consists of an API key that is tied to your account. Keep it private.

Install Roboflow Python

The Roboflow Python Package is a Python wrapper around the core Roboflow web application and REST API. To install, run:
pip install roboflow
Now you can use the Roboflow Python package to download the dataset programmatically into your local machine.
from roboflow import Roboflow
rf = Roboflow(api_key="YOUR_API_KEY")
❗️ API Key Replace YOUR_API_KEY with your own API key from Roboflow. Do not share this key beyond your team, it contains a private key that is tied to your Roboflow account.

Download Dataset

For this tutorial, let’s download the Dash Diet 101 Dataset in COCO annotations format into our local folder.
project = rf.workspace("dash101").project("dash-diet-101")
dataset = project.version(4).download("coco")
Once completed, you should have a folder in your current directory with the name DASH-DIET-101-4. The DASH-DIET-101 dataset was created by Bhavya Bansal, Nikunj Bansal, Dhruv Sehgal, Yogita Gehani, and Ayush Rai with the goal of creating a model to detect food items that reduce Hypertension. It contains 16,900 images of 101 popular food items with annotated bounding boxes.

Analyze Bounding Boxes with fastdup

To run fastdup, you only need to point input_dir to the folder containing images from the dataset.
fd = fastdup.create(input_dir='./DASH-DIET-101-4/train')
fastdup works on both labeled and unlabeled datasets. Since this dataset is labeled, let’s make use of the labels by passing them into the run method.
fd.run(annotations='DASH-DIET-101-4/train/_annotations.coco.json')
Now sit back and relax as fastdup analyzes the dataset.

Invalid Bounding Boxes

Since this dataset is annotated with bounding boxes, let’s check if all the bounding boxes are valid. Bounding boxes that are either too small or go beyond image boundaries are flagged as bad bounding boxes in fastdup. Let’s get the invalid bounding boxes.
fd.invalid_instances()
import pandas as pd
bad_bb = pd.read_csv('work_dir/full_image_run/atrain_features.bad.csv')
bad_bb
Let’s count the number of images with bad bounding boxes.
bad_bb['error_code'].value_counts()
error_code
ERROR_BAD_BOUNDING_BOX    22
Name: count, dtype: int64
The output shows a total of 22 images contains bounding box issues. Now it is up to you how to deal with these bounding boxes. You can choose to relabel it or simply discard the entiere image from the training set.

Label Distribution

Let’s the label distribution in a bar chart.
📘 Info This code snippet uses plotly to plot, install plotly with:
pip install plotly
import plotly.express as px
fig = px.histogram(fd.annotations(), x="label")
fig.show()

Bounding Box Size and Shape Issues

Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest, and widest objects, and assess their usefulness. Let’s get the annotations and calculate the area and aspect ratio.
df = fd.annotations()
df['area'] = df['width'] * df['height']
df['aspect'] = df['width'] / df['height']
Next, we filter for the smallest 5% bounding boxes and 5% of extreme aspect ratios.
# Smallest 5% of objects:
smallest_objects = df[df['area'] < df['area'].quantile(0.05)].sort_values(by=['area'])

# 5% of extreme aspect ratios
aspect_ratio_objects = df[(df['aspect'] < df['aspect'].quantile(0.05))
                         |(df['aspect'] > df['aspect'].quantile(0.95))].sort_values(by=['aspect'])
Now let’s create a simple function to visualize the images.
from PIL import Image
import matplotlib.pyplot as plt

def plot_image_gallery(df, num_images=5):
    # Create 1x5 subplots
    fig, axes = plt.subplots(1, num_images, figsize=(15, 3))
    
    # Plot each image in a subplot
    for ax, (_, row) in zip(axes, df.iterrows()):
        image_path = row['crop_filename']
        label = row['label']
        
        # Open the image using PIL
        img = Image.open(image_path)
        
        # Display image
        ax.imshow(img)
        ax.axis('off')
        
        # Set title
        ax.set_title(label)
    
    # Show plot
    plt.show()
    
View and visualize the smallest objects DataFrame and plot.
smallest_objects.head()
plot_image_gallery(smallest_objects)
View the top and bottom-most extreme aspect ratio.
aspect_ratio_objects.head()
plot_image_gallery(aspect_ratio_objects.head())
plot_image_gallery(aspect_ratio_objects.tail())

Visualize Issues with fastdup Gallery

There are several other methods we can use to inspect and visualize the issues found.
fd.vis.duplicates_gallery()    # create a visual gallery of duplicates
fd.vis.outliers_gallery()      # create a visual gallery of anomalies
fd.vis.component_gallery()     # create a visualization of connected components
fd.vis.stats_gallery()         # create a visualization of images statistics (e.g. blur)
fd.vis.similarity_gallery()    # create a gallery of similar images

Duplicates & Near-duplicates

First, let’s visualize the duplicate images at the bounding box level.
📘 Note The duplicates visualized here is at the bounding box level, NOT at the image level. In other words, the bounding box image is cropped from the original image and compared to other bounding box images for duplicates.
fd.vis.duplicates_gallery()
Now let’s get the number of exact/near duplicates.
similarity_df = fd.similarity()
near_duplicates = similarity_df[similarity_df["distance"] >= 0.98]

near_duplicates = near_duplicates[['distance', 'crop_filename_from', 'crop_filename_to']]
near_duplicates.head()
Let’s see how many near duplicates we find by checking the number of rows in the DataFrame.
len(near_duplicates)
20394
A distance value of 1.0 indicates an exact duplicate. As a sanity check, let’s show the images that are flagged as duplicates here.
from IPython.display import Image
Image(filename="work_dir/crops/DASH-DIET-101-4trainR-1_jpg.rf.f5a1151cb95e5bbc7d04efb70c805ab0.jpg_193_107_208_215.jpg")
Image(filename="work_dir/crops/DASH-DIET-101-4trainR-1_jpg.rf.00f7c3e0c3b443aa1b4b5a04dc6f26eb.jpg_180_101_216_209.jpg")

Image Clusters

We can also view similar-looking images forming clusters.
fd.vis.component_gallery()

Bright/Dark/Blurry Images

We can show the brightest images from the dataset in a gallery. Change metric to blur or dark to view blurry and dark images.
fd.vis.stats_gallery(metric='bright')
To get a DataFrame with statistical details for each bounding box image.
fd.img_stats()

Mislabels

fd.vis.similarity_gallery(slice='diff')

Wrap Up

That’s it! We’ve just conveniently surfaced many issues with this dataset by running fastdup. By taking care of dataset quality issues, we hope this will help you train better models. Questions about this tutorial? Reach out to us on our Slack channel!

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser. VL Profiler lets you find:
  • Duplicates/near-duplicates.
  • Outliers.
  • Mislabels.
  • Non-useful images.
Here’s a preview of VL Profiler.
👍 Free Usage Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images. Get started for free.
Not convinced yet? Interact with a collection of dataset like ImageNet-21K, COCO, and DeepFashion here. No sign-ups needed.