Guides

Creating a Dataset with fastdup

After you install the fastdup package, you need to import it and then create a dataset object.

This can be done using the fastdup.create() function as follows:

import fastdup

fd = fastdup.create(work_dir="fastdup_work_dir/", input_dir="images/")

📘

Parameters for fastdup.create

  • work_dir - Path to store the artifacts generated from the data analysis (mandatory)
  • input_dir - Path to the images/videos to analyze (non-mandatory, may be provided later)

Possible values for the input_dir parameter:

  • A folder
  • A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k
  • A file containing absolute filenames each on its own row
  • A file containing s3 full paths or minio paths each on its own row
  • A python list with absolute filenames
  • A python list with absolute folders, all images and videos on those folders are added recursively
  • yolo-v5 yaml input file containing train and test folders (single folder supported for now)
  • We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, webp, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images. Support also 16 bit RGBA, RGB and grayscale images.

If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the known extensions and use opencv to read those formats

Note: It is not possible to mix compressed (videos or tars/zips) and regular images.
Use the flag tar_only=True if you want to ignore images and run from compressed files
Note2: We assume image sizes should be larger or equal to 10x10 pixels.
Smaller images (either on width or on height) will be ignored with a warning shown
Note3: It is possible to skip small images also by defining minimum allowed file size using
min_file_size=1000 (in bytes)
Note4: For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow, Alternatively you can use the flag sync_s3_to_local=True to copy ahead all images on the remote s3 bucket to disk

Note5: fastdup can read images from other format extensions as well, as long they are supported in opencv.imread(). If the files are not ending with a common image prefix, you can prepare a csv file with full image path, one per row, no commas please!

A successful execution of the create() function yields the following output:

fastdup By Visual Layer, Inc. 2024. All rights reserved.

A fastdup dataset object was created!

Input directory is set to "/path/to/input/directory"
Work directory is set to "/path/to/work/directory"

The next steps are:
   1. Analyze your dataset with the .run() function of the dataset object
   2. Interactively explore your data on your local machine with the .explore() function of the dataset object

For more information, use help(fastdup) or check our documentation.

Since the input directory can be set later on in the run() function, it's not a non-mandatory parameter. If you don't provide it, you will see the following warning in the output:

fastdup By Visual Layer, Inc. 2024. All rights reserved.

A fastdup dataset object was created!

Input directory is still not set. To proceed, you must provide an input directory to the .run() function or call .create() again.
Work directory is set to "/path/to/work/directory"

The next steps are:
   1. Analyze your dataset with the .run() function of the dataset object
   2. Interactively explore your data on your local machine with the .explore() function of the dataset object

For more information, use help(fastdup) or check our documentation.

After the dataset object is created, you can now access it's functions, such as run() and explore().