Guides

Analyzing Data with fastdup

After the dataset object is created, you can now access it's functions, such as run() and explore().

The run() functions analyzes the data in the input directory, calculates similarities between images and identifies data quality issues in it.

fd.run()

The output of this function is as follows:

fastdup By Visual Layer, Inc. 2024. All rights reserved.
Done: 100%|██████████████████████████████████████████████| 3/3 [01:20<00:00, 26.86s/it]
Analysis complete. Use the .explore() function to interactively explore your data on your local machine.

Alternatively, you can generate HTML-based galleries.
For more information, use help(fastdup) or check our documentation [link].

View analysis summary

Once the run is completed, you can optionally view the summary of the run with:

fd.summary()

The output of this function is as follows:

 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 7390 images
    Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data
    For a detailed analysis, use `.invalid_instances()`.

    Components:  failed to find images clustered into components, try to run with lower cc_threshold.
    Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.

['Dataset contains 7390 images',
 'Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data',
 'For a detailed analysis, use `.invalid_instances()`.\n',
 'Components:  failed to find images clustered into components, try to run with lower cc_threshold.',
 'Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.\n']

run() arguments

We recommend using the run() function without any parameters, unless you have a specific use case in mind that requires advanced tweaking, for which it has the following parameters:

def run(input_dir: Union[str, Path, list] = None,
        annotations: pd.DataFrame = None,
        embeddings=None,
        subset: list = None,
        data_type: str = 'infer',
        overwrite: bool = True,
        model_path=None,
        distance='cosine',
        nearest_neighbors_k: int = 2,
        threshold: float = 0.9,
        outlier_percentile: float = 0.05,
        num_threads: int = None,
        num_images: int = None,
        verbose: bool = False,
        license: str = None,
        high_accuracy: bool = False,
        cc_threshold: float = 0.96,
        sync_s3_to_local: bool = False,
        run_stats: bool = True, 
        run_advanced_stats: bool = False,
        nnf_mode: str = "HNSW32",
        find_regex: str = "",
        bounding_box: str = None, 
        augmentation_additive_margin: int = 0, 
        augmentation_vert: float = 0, 
        augmentation_horiz: float = 0,
        **kwargs)

Arguments:

input_dir: Location of the images/videos to analyze

  • A folder
  • A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k
  • A file containing absolute filenames each on its own row
  • A file containing s3 full paths or minio paths each on its own row
  • A python list with absolute filenames
  • A python list with absolute folders, all images and videos on those folders are added recursively
  • yolo-v5 yaml input file containing train and test folders (single folder supported for now)
  • We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, webp, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images. Support also 16 bit RGBA, RGB and grayscale images.

If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the known extensions and use opencv to read those formats

Note: It is not possible to mix compressed (videos or tars/zips) and regular images.
Use the flag tar_only=True if you want to ignore images and run from compressed files
Note2: We assume image sizes should be larger or equal to 10x10 pixels.
Smaller images (either on width or on height) will be ignored with a warning shown
Note3: It is possible to skip small images also by defining minimum allowed file size using
min_file_size=1000 (in bytes)
Note4: For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow, Alternatively you can use the flag sync_s3_to_local=True to copy ahead all images on the remote s3 bucket to disk

Note5: fastdup can read images from other format extensions as well, as long they are supported in opencv.imread(). If the files are not ending with a common image prefix, you can prepare a csv file with full image path, one per row, no commas please!

  • annotations: Optional dataframe with annotations. Images are given in the column filename. Optional class labels are given in the column label.
    • Optional bounding box structure contains the fields col_x, row_y, width, height.
    • Optional rotated bounding box contains the fields x1,y1,x2,y2,x3,y3,x4,y4
    • Alternatively, annotations can point to a json file containing COCO annotations.
    • Alternatively, annotationss can be a dictionary containing COCOannotations.
  • subset: List of images to run on. If None, run on all the images/bboxes.
  • data_type: Type of data to run on. Supported types: 'image', 'bbox'. Default is 'infer'.
  • model_path: path for an alternative onnx/ort model for feature vector extraction. supported formats are all onnx, ort files. (Need to make sure model output has a single channel, please reach out to us for adding support for additional models).
    Make sure to update d parameter (feature vector width) accordingly when changing the model file. Reserved values (models that are automatically downloaded) are:
    • None - default fastdup model
  • dinov2s : Meta's dinov2 model small
  • dinov2b: Meta's dinov2 model big
  • clip: OpenAi's ViT-B/32 clip model
  • clip336 OpenAI's ViT-L-14@336px clip model
  • resent50 resnet50-v1-12.onnx model from GitHub onnx.
  • efficientnet``efficientnet-lite4-11 model from GitHub onnx.
  • Note: need to check model provider license, we do not provide the model it is downloaded directly from the provider and usage should conform to the model license.
  • distance: - distance metric for the Nearest Neighbors algorithm.
    The default is 'cosine' which works well in most cases. For nn_provider='nnf' the following distance metrics
    are supported. When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra',
    'braycurtis','jensenshannon' are supported. Otherwise 'cosine' and 'euclidean' are supported.,
  • num_images: Number of images to run on. On default, run on all the images in the image_dir folder. When running from s3 bucket with large number of images, speeds up run as it can limit the number of images consumed.
  • nearest_neighbors_k: Number of similarities to compute per image or video frame.
  • high_accuracy: Compute a more accurate model. Runtime is increased about 15% and feature vector storage
    size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in
    images with many objects.
  • outlier_percentile: Percentile of the outlier score to use as threshold. Default is 0.5 (50%).
  • threshold: Threshold to use for the graph generation. Default is 0.9.
  • cc_threshold: Threshold to use for the graph connected component. Default is 0.96.
  • bounding_box: Optional bounding box to crop images, given as
    bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used
    for all images.
    • bounding_box='face' runs a face detection model and crops the face from the image (in case a face is present).
    • bounding_box='ocr' runs OCR model on the image and crops any text images.
    • bounding_box='yolov5s'runs yolov5s model on the image and crops and objects.
    • In both 'face' / 'ocr' / 'yolov5s' modes, an output file named atrain_crops.csv is created in the work_dir listing all crop dimensions and source images.
    • For the face crop the margin around the face is defined by augmentation_horiz=0.2, augmentation_vert=0.2 where 0.2 mean 20% additional margin around the face relative to the width and height respectively. Lower value is 0 (no margin) and upper allowed value is 1. Default is 0.2. Another parameter is augmentation_additive_margin which ads X pixels around the object frames. The margin arguments can not be used together, it is either multiplicative or additive margin.
  • num_threads: Number of threads. By default, autoconfigured by the number of cores.
  • license: Optional license key. If not provided, only free features are available.
  • overwrite: Optional flag to overwrite existing fastdup results.
  • verbose: Verbosity level. Set to True when debugging issues.
  • kwargs: Additional parameters for fastdup.
  • d: Model Output dimension. Default is 576.
  • min_offset: Optional min offset to start iterating on the full file list. When using a folder lists the folder and then starts from position min_offset in the list. This allows for parallel feature extraction.
  • max_offset: Optional max offset to start iterating on the full file list. When using a folder lists the folder and then starts from position min_offset in the list. This allows for parallel feature extraction.
  • nnf_mode: Selects the nnf model mode. default is HNSW32. Flat is exact and not an approximation.
  • nnf_param: Selects and assigns optional parameters.
  • resume: Optional flag to resume tar extraction from a previous run.
  • run_cc: Run connected components on the resulting similarity graph. Default is True.
  • delete_tar : Delete tar after download from s3/minio.
  • delete_img : Delete images after download from s3/minio.
  • run_stats : Computer image statistics (default is True)
  • run_advanced_stats: Compute enhanced image statistics like hue, saturation, contrast etc.
  • sync_s3_to_local: When using aws s3 bucket, sync s3 to local folder to improve performance (recommended). Assumes there is enough local disk space to contain the data. Default is False.
  • find_regex optional regex to control the images selected for run when running from a local folder.