Duplicate Images/Objects

What Are Duplicate Images/Objects?

Duplicate Images refer to multiple copies or versions of the same Image. These copies can exist within a single Dataset or across different Datasets. Duplicate Images have identical or nearly identical visual content, pixel by pixel, regardless of any variations in Image format, resolution, or file size.

Here are a few scenarios in which duplicate Images can occur:

  1. Replication: Duplicate Images can be created intentionally or unintentionally by making copies of an original image. This can occur when users or systems duplicate Images to store them in different locations, back up data, or share them across platforms.

  2. Data collection: Encountering duplicate Images is common when collecting large Datasets from various sources. Different sources may provide similar or identical Images, especially in cases where the Dataset is compiled from publicly available sources or stock image libraries. For example, web crawling or image scraping processes can inadvertently retrieve multiple copies of the same Image from different URLs or sources, leading to duplicate Images in a Dataset.

  3. Data augmentation: In machine learning, data augmentation techniques are often used to artificially expand the training Dataset by applying transformations or modifications to existing Images. Sometimes, these transformations can result in Images that are nearly identical to the original, thus creating duplicates.

Why Is This a Pain?

Detecting and handling duplicate Images is important for several reasons, such as:

  1. Data utilization and processing efficiency: Duplicate Images consume unnecessary storage space, especially when dealing with large Datasets. When training or processing Images, duplicate copies offer no additional information. Removing duplicates helps optimize storage resources.

  2. Data quality: Duplicate Images can skew statistical analysis and machine learning algorithms by inflating the representation of certain Images or classes. Removing duplicates ensures a balanced, representative Dataset.

Duplicate Images vs. Objects

Duplicate Images: Image groups that are identical or nearly identical to one another (e.g., "Found 30 Images with above 99% similarity”).

Duplicate Objects: Object groups that are identical or nearly identical to one another (e.g. "Found 30 Objects with above 99% similarity").

Object duplications may stem from multiple overlapping bounding boxes in the same Image, such as when annotators annotate the same Object more than once with minor sizing and position differences.

Possible Mitigation:

The Visual Layer platform provides users with powerful tools for data deduplication.

Before using or analyzing a Dataset, we advise performing preprocessing and data cleaning steps to identify and remove duplicates. Using a graph-based similarity engine, Visual Layer can detect duplicates based on Image similarity. This process ensures that the Dataset is free from redundant or repeated Images.