Loading Data from External Sources

Roboflow Universe

Roboflow Universe hosts over 200,000 computer vision datasets.

In order to download datasets from Roboflow Universe, sign-up for free.

Now, head over to https://universe.roboflow.com/ to search for the dataset of interest.

Once you find a dataset, click on the 'Download Dataset' button on the dataset page.

A pop-up will appear with a code snippet to download the dataset into your local machine. Copy the code snippet.



The code snippet consists of an API key that is tied to your account. Keep it private.

Install Roboflow Python

The Roboflow Python Package is a Python wrapper around the core Roboflow web application and REST API. To install, run:

pip install roboflow

Now you can use the Roboflow Python package to download the dataset programmatically into your local machine.

from roboflow import Roboflow
rf = Roboflow(api_key="YOUR_API_KEY")



Replace YOUR_API_KEY with your own API key from Roboflow. Do not share this key beyond your team, it contains a private key that is tied to your Roboflow account.

Download Dataset

For this tutorial, let's download the Dash Diet 101 Dataset in COCO annotations format into our local folder.

project = rf.workspace("dash101").project("dash-diet-101")
dataset = project.version(4).download("coco")

Once completed, you should have a folder in your current directory with the name DASH-DIET-101-4.

The DASH-DIET-101 dataset was created by Bhavya Bansal, Nikunj Bansal, Dhruv Sehgal, Yogita Gehani, and Ayush Rai with the goal of creating a model to detect food items that reduce Hypertension.

It contains 16,900 images of 101 popular food items with annotated bounding boxes.

Analyze Bounding Boxes with fastdup

To run fastdup, you only need to point input_dir to the folder containing images from the dataset.

fd = fastdup.create(input_dir='./DASH-DIET-101-4/train')

fastdup works on both labeled and unlabeled datasets. Since this dataset is labeled, let's make use of the labels by passing them into the run method.


Now sit back and relax as fastdup analyzes the dataset.


To load any dataset from Kaggle you first need to sign-up for an account. It's free.

On Kaggle, you can browse for a dataset of interest and manually download it on your machine.

Kaggle API

Alternatively, you can use the Kaggle API to programmatically download any dataset using Python.

To install the Kaggle API run

pip install -Uq kaggle

After signing up for an account Kaggle account, head over to the 'Account' tab and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials.

Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\<Windows-username>\.kaggle\kaggle.json. Read more here.

If the setup is done correctly, you should be able to run the Kaggle commands on your terminal. For instance, to list Kaggle datasets that have the term "computer vision", run

kaggle datasets list -s "computer vision"
ref                                                           title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
jeffheaton/iris-computer-vision                               Iris Computer Vision                                  5MB  2020-11-24 21:23:29           1415         20  0.875            
bhavikardeshna/visual-question-answering-computer-vision-nlp  Visual Question Answering- Computer Vision & NLP    411MB  2022-06-14 04:32:28            421         37  0.8235294        
sanikamal/horses-or-humans-dataset                            Horses Or Humans Dataset                            307MB  2019-04-24 20:09:38           8405        120  0.875            
phylake1337/fire-dataset                                      FIRE Dataset                                        387MB  2020-02-25 16:45:29          12098        180  0.875            
fedesoriano/cifar100                                          CIFAR-100 Python                                    161MB  2020-12-26 08:37:10           4881        116  1.0              
fedesoriano/chinese-mnist-digit-recognizer                    Chinese MNIST in CSV - Digit Recognizer               8MB  2021-06-08 12:15:47            966         45  1.0              
bulentsiyah/opencv-samples-images                             OpenCV samples (Images)                              13MB  2020-05-19 14:36:01           2374         72  0.75             
jeffheaton/traveling-salesman-computer-vision                 Traveling Salesman Computer Vision                    3GB  2022-04-20 01:13:17            183         22  0.875            
sanikamal/rock-paper-scissors-dataset                         Rock Paper Scissors Dataset                         452MB  2019-04-24 19:53:04           4556         78  0.875            
muratkokludataset/dry-bean-dataset                            Dry Bean Dataset                                      5MB  2022-04-02 23:19:30           2303       1464  0.9375           
juniorbueno/opencv-facial-recognition-lbph                    OpenCV - Facial Recognition - LBPH                    6MB  2021-12-01 10:47:12            487         45  0.875            
rickyjli/chinese-fine-art                                     Chinese Fine Art                                    323MB  2020-05-02 03:00:40            821         38  0.8235294        
mpwolke/cusersmarildownloadsmondrianpng                       Computer Vision. C'est  Audacieux, Luxueux,  Chic!  417KB  2022-04-10 21:41:35             10         20  1.0              
paultimothymooney/cvpr-2019-papers                            CVPR 2019 Papers                                      5GB  2019-06-16 18:28:50            934         50  0.875            
emirhanai/human-action-detection-artificial-intelligence      Human Action Detection - Artificial Intelligence    147MB  2022-04-22 21:07:24           1468         40  1.0              
vencerlanz09/plastic-paper-garbage-bag-synthetic-images       Plastic - Paper - Garbage Bag Synthetic Images      451MB  2022-08-26 09:57:18           1127         76  0.875            
shaunthesheep/microsoft-catsvsdogs-dataset                    Cats-vs-Dogs                                        788MB  2020-03-12 05:34:30          27897        345  0.875            
birdy654/cifake-real-and-ai-generated-synthetic-images        CIFAKE: Real and AI-Generated Synthetic Images      105MB  2023-03-28 16:00:29           1702         44  0.875            
ryanholbrook/computer-vision-resources                        Computer Vision Resources                            13MB  2020-07-23 10:40:17           2491         11  0.1764706        
fedesoriano/qmnist-the-extended-mnist-dataset-120k-images     QMNIST - The Extended MNIST Dataset (120k images)    19MB  2021-07-24 15:31:01            844         29  1.0     

See more commands here.

Optionally, you can also browse the Kaggle webpage to see the dataset you're interested to download.

Download Dataset

Let's say we're interested in analyzing the RVL-CDIP Test Dataset.

You can head to the dataset page click on the 'Copy API command' button and paste it into your terminal.

kaggle datasets download -d pdavpoojan/the-rvlcdip-dataset-test

Once done, we should have a the-rvlcdip-dataset-test.zip in the current directory.

Let's unzip the file for further analysis with fastdup in the next section.

unzip -q the-rvlcdip-dataset-test.zip

Once completed, we should have a folder with the name test/ which contains all the images from the dataset.

Load Annotations



This step is optional. fastdup works with both labeled and unlabeled datasets.

If you decide not to load the annotations you can simply run fastdup with just the following codes.

import fastdup  
fd = fastdup.create(input_dir="IMAGE_FOLDER/")  

Although you can run fasdup without the annotations, specifying the labels lets us do more analysis with fastdup such as inspecting mislabels.

Since the dataset is labeled, let's make use of the labels and feed them into fastdup.

fastdup expects the labels to be formatted into a Pandas DataFrame with the columns filename and label.

Let's loop over the directory recursively search for the filenames and labels, and format them into a DataFrame.

import glob
import os
import pandas as pd

# Define the path
path = "test/"

# Define patterns for tif image found in the dataset
patterns = ['*tif']

# Use glob to get all image filenames for both extensions
filenames = [f for pattern in patterns for f in glob.glob(path + '**/' + pattern, recursive=True)]

# Extract the parent folder name for each filename
label = [os.path.basename(os.path.dirname(filename)) for filename in filenames]

# Convert to a pandas DataFrame and add the title label column
df = pd.DataFrame({
    'filename': filenames,
    'label': label
filename label
0 test/advertisement/12636110.tif advertisement
1 test/advertisement/926916.tif advertisement
2 test/advertisement/502599726+-9726.tif advertisement
3 test/advertisement/509132392+-2392.tif advertisement
4 test/advertisement/12888045.tif advertisement

Run fastdup

To fastdup with the annotations DataFrame, let's point the input_dir to the image folders and annotations to df DataFrame.

fd = fastdup.create(input_dir='test')

Now sit back and relax as fastdup analyzes the dataset.

Hugging Face

The Hugging Face datasets package provides an easy interface to load any datasets from the Hugging Face platform. On top of the package, fastdup provides a wrapper class FastdupHFDataset as a connector to ensure the datasets package works seamlessly within fastdup.

The FastdupHFDataset class works the same way as the load_dataset method. You can import the wrapper class and specify the name of the Hugging Face Datasets repository as the first argument.

In this example, we load the Tiny ImageNet dataset which contains 100,000 images of 200 classes (500 for each class) downsized to 64Γ—64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

In the following code, we load the train split of the Tiny ImageNet dataset.

from fastdup.datasets import FastdupHFDataset
dataset = FastdupHFDataset("zh-plus/tiny-imagenet")



Optional parameters for the FastdupHFDataset class:

  • split - Which split to download. Default: 'train'.
  • img_key- The key value for the dataset column containing images. Default: 'image'.
  • label_key - The key value for the dataset column containing labels. Default: 'label'.
  • cache_dir - Where to cache the downloaded dataset. Default: '/root/.cache/huggingface/datasets/'
  • jpg_save_dir - Which folder to store the jpg images. Default: 'jpg_images'
  • reconvert_jpg- Flag to force reconversion of images from .parquet to .jpg. Default: False

See implementation for details.

Now, let's inspect the dataset object.

    features: ['image', 'label'],
    num_rows: 100000

Get the first element of the dataset.

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64>,
 'label': 0}

Get the PIL image of the first element.


Get the label of the first element.




You can also confirm the image and label of the first element by heading to the dataset page.

Run fastdup

Once loaded, we can now analyze the dataset in fastdup by passing in 2 properties of dataset into fastdup:

  • dataset.img_dir - Returns the folder directory where the jpg images are saved.
  • dataset.annotations- Returns aDataFrame of image and class labels.
filename label
0 /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19443.jpg 38
1 /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19127.jpg 38
2 /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19199.jpg 38
3 /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19271.jpg 38
4 /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19213.jpg 38

Let's run fastdup and pass indataset.img_dir and dataset.annotations as arguments.

fd = fastdup.create(input_dir=dataset.img_dir)