Visualizing Data with fastdup

Overview

There are two ways to visualize the fastdup results:

  • Using the interactive data exploration
  • Using the HTML galleries

Using the interactive data exploration

To explore the dataset and issues interactively in a browser, run:

fd.explore()

This function will launch the Visual Layer application locally on your machine, and show the following outupt:

The Visual Layer application was launched on your machine, you can find it on http\://localhost:9999/dataset/6ea3a31c-c02c-4748-8f05-44b179332305/data?page=1 in your web browser.  
Use Ctrl + C to stop the application server.

For more information, use help(fastdup) or check our documentation [link].

πŸ“˜

Note - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.

You'll be presented with the following view that lets you conveniently view, filter and curate your dataset in a web interface.

You can learn more about using the data exploration to explore your data.

Using the HTML galleries

For visualizing the results or individual parts of the analysis, fastdup generates galleries in the form of HTML files that are saved to the galleries sub-dir of the work directory and presented interactively when using Jupyter notebooks.

Starting from V1.0 galleries have a new layer of abstraction that automatically adds bounding boxes and labels to images where available.

There are 5 types of galleries:

fd.vis.duplicates_gallery() # create a visual gallery of duplicates fd.vis.outliers_gallery() # create a visual gallery of anomalies fd.vis.component_gallery() # create a visualization of connected components fd.vis.stats_gallery() # create a visualization of images statistics (e.g. blur) fd.vis.similarity_gallery() # create a gallery of similar images

Components: Fastdup.vis.component_gallery

Duplicates: Fastdup.vis.duplicates_gallery

Outliers: Fastdup.vis.outliers_gallery

Image statistics: Fastdup.vis.stats_gallery

Similarity: Fastdup.vis.similarity_gallery


Gallery configuration

Galleries share a few methods and arguments used for visualizing labels and bounding boxes, and for setting general attributes:

  • slice: Visualize a subset of the data with the given label, e.g., slice='dog'

  • sort_by: Sort images by a property, supported are:

    • default: comp_size - Number of images in the component
    • distance - The average distance between cluster members. Clusters where the images are most similar will be presented first
    • area - From the largest to the smallest image or bounding box average size
  • label_col: Column to use as labels, common options are label, split and img_filename.

  • num_images: (default=20) The number of images to visualize.

  • max_width: (default=None) Pixel width of displayed gallery. Useful values are often in the 800-1200 range.

  • lazy_load: (default=False) When False, images are embedded into the gallery HTML files. Otherwise images are loaded by the browser using their relative paths. Using lazy_load makes galleries lighter and faster to generate, but less portable and shareable. On the other hand, Without lazy loading galleries become very large files.

Adding Labels

The label_col argument controls the labels appended to each image visualized. By default it fetches labels from the label column in the annotations dataframes provided during the fastdup.run() call. When labels are not provided, or if the use of another column is desired, the label_col argument could be set for using the required column.