Table of Contents
fastdup.engine
Fastdup Objects
This class provides all fastdup capabilities as a single class.
Usage example
run
input_dir: Location of the images/videos to analyze
- A folder
- A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k
- A file containing absolute filenames each on its own row
- A file containing s3 full paths or minio paths each on its own row
- A python list with absolute filenames
- A python list with absolute folders, all images and videos on those folders are added recursively
- yolo-v5 yaml input file containing train and test folders (single folder supported for now)
- We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, webp, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images. Support also 16 bit RGBA, RGB and grayscale images.
Use the flag tar_only=True if you want to ignore images and run from compressed files
Note2: We assume image sizes should be larger or equal to 10x10 pixels.
Smaller images (either on width or on height) will be ignored with a warning shown
Note3: It is possible to skip small images also by defining minimum allowed file size using
min_file_size=1000 (in bytes)
Note4: For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow, Alternatively you can use the flag
sync_s3_to_local=True to copy ahead all images on the remote s3 bucket to disk
Note5: fastdup can read images from other format extensions as well, as long they are supported in opencv.imread(). If the files are not ending with a common image prefix, you can prepare a csv file with full image path, one per row, no commas please!
annotations: Optional dataframe with annotations. Images are given in the columnfilename. Optional class labels are given in the columnlabel.- Optional bounding box structure contains the fields
col_x,row_y,width,height. - Optional rotated bounding box contains the fields
x1,y1,x2,y2,x3,y3,x4,y4 - Alternatively,
annotationscan point to ajsonfile containingCOCOannotations. - Alternatively,
annotationsscan be a dictionary containingCOCOannotations.
- Optional bounding box structure contains the fields
subset: List of images to run on. If None, run on all the images/bboxes.data_type: Type of data to run on. Supported types: ‘image’, ‘bbox’. Default is ‘infer’.model_path: path for an alternative onnx/ort model for feature vector extraction. supported formats are all onnx, ort files. (Need to make sure model output has a single channel, please reach out to us for adding support for additional models).
Make sure to updatedparameter (feature vector width) accordingly when changing the model file. Reserved values (models that are automatically downloaded) are:- None - default fastdup model
dinov2s: Meta’s dinov2 model smalldinov2b: Meta’s dinov2 model bigclip: OpenAi’sViT-B/32clip modelclip336OpenAI’sViT-L-14@336pxclip modelresent50resnet50-v1-12.onnxmodel from GitHub onnx.efficientnet``efficientnet-lite4-11model from GitHub onnx.- Note: need to check model provider license, we do not provide the model it is downloaded directly from the provider and usage should conform to the model license.
distance: - distance metric for the Nearest Neighbors algorithm.
The default is ‘cosine’ which works well in most cases. For nn_provider=‘nnf’ the following distance metrics
are supported. When using nnf_mode=‘Flat’: ‘cosine’, ‘euclidean’, ‘l1’,‘linf’,‘canberra’,
‘braycurtis’,‘jensenshannon’ are supported. Otherwise ‘cosine’ and ‘euclidean’ are supported.,num_images: Number of images to run on. On default, run on all the images in the image_dir folder. When running from s3 bucket with large number of images, speeds up run as it can limit the number of images consumed.nearest_neighbors_k: Number of similarities to compute per image or video frame.high_accuracy: Compute a more accurate model. Runtime is increased about 15% and feature vector storage
size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in
images with many objects.outlier_percentile: Percentile of the outlier score to use as threshold. Default is 0.5 (50%).threshold: Threshold to use for the graph generation. Default is 0.9.cc_threshold: Threshold to use for the graph connected component. Default is 0.96.bounding_box: Optional bounding box to crop images, given as
bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used
for all images.bounding_box='face'runs a face detection model and crops the face from the image (in case a face is present).bounding_box='ocr'runsOCRmodel on the image and crops any text images.bounding_box='yolov5s'runsyolov5smodel on the image and crops and objects.- In both
'face' / 'ocr' / 'yolov5s'modes, an output file namedatrain_crops.csvis created in thework_dirlisting all crop dimensions and source images. - For the face crop the margin around the face is defined by
augmentation_horiz=0.2, augmentation_vert=0.2where 0.2 mean 20% additional margin around the face relative to the width and height respectively. Lower value is 0 (no margin) and upper allowed value is 1. Default is 0.2. Another parameter isaugmentation_additive_marginwhich ads X pixels around the object frames. The margin arguments can not be used together, it is either multiplicative or additive margin.
num_threads: Number of threads. By default, autoconfigured by the number of cores.license: Optional license key. If not provided, only free features are available.overwrite: Optional flag to overwrite existing fastdup results.verbose: Verbosity level. Set to True when debugging issues.kwargs: Additional parameters for fastdup.d: Model Output dimension. Default is 576.min_offset: Optional min offset to start iterating on the full file list. When using a folder lists the folder and then starts from position min_offset in the list. This allows for parallel feature extraction.max_offset: Optional max offset to start iterating on the full file list. When using a folder lists the folder and then starts from position min_offset in the list. This allows for parallel feature extraction.nnf_mode: Selects the nnf model mode. default isHNSW32.Flatis exact and not an approximation.nnf_param: Selects and assigns optional parameters.resume: Optional flag to resume tar extraction from a previous run.run_cc: Run connected components on the resulting similarity graph. Default is True.delete_tar :Delete tar after download from s3/minio.delete_img: Delete images after download from s3/minio.run_stats: Computer image statistics (default is True)run_advanced_stats: Compute enhanced image statistics like hue, saturation, contrast etc.sync_s3_to_local: When using aws s3 bucket, sync s3 to local folder to improve performance (recommended). Assumes there is enough local disk space to contain the data. Default is False.find_regexoptional regex to control the images selected for run when running from a local folder.
fastdup.fastdup_controller
FastdupController Objects
__init__
fastdup files such as: similarity, csv outlier csv, etc… Moreover, the class provides several extra features:
- Ability to run connected component analysis on splits without calling fastdup run again
- Ability to add annotation file and quickly merge it to any of fastdup inputs
Currently the class support running fastdup on images and object
work_dir: target output dir or existing output dirinput_dir: (Optional) path to data dir
num_instances
valid_only: if True, return only valid annotations
annotations
valid_only: if True, return only valid annotations
similarity
data: add annotationsplit: filter by splitinclude_unannotated: include instances that are not represented in the annotations
outliers
data: add annotationsplit: filter by splitinclude_unannotated: include instances that are not represented in the annotations
embeddings
d: feature vector width (on default 576).
np.ndarray containing a matrix with the embedding, each row is on image and matrix width is d.
feature_vector
img_path(str): a path pointing to an image, could be local or s3 or a minio path, see run() documentationmodel_path(str) : optional path pointing to onnx/ort model, see run() documentationd(int) : feature vector width (on default 576).
embeddings: 1 x d numpy matrix contains feature embeddings (row vector)files: the image filename used to generate the embedding (should be equal toimg_path)
feature_vectors
img_path(str): a path pointing to a folder, could be local or s3 or a minio path, or a list of images, see run() documentationmodel_path(str) : optional path pointing to onnx/ort model, see run() documentationd(int) : feature vector width (on default 576).
embeddings: n x d numpy matrix contains feature embeddings (row vector)files: the image filenames used to generate the embedding (this is important since in case of broken images no embedding are created, in adition the order of the images may change based on your file system.
init_search
This function should be called only once before running searches. The search function is search(). Arguments:
k(int): number of nearest neighbors to search fowork_dir(str): working directory wherefastdup.runwas run.d(int): (Optional) dimension of the feature vector. Defualt is 576.model_path(str): (Optional): path to the onnx model file. .verbose(bool): (Optional): True for verbose modelicense(str): License key for using search.store_int(int): 0 to return filename, 1 to return offsetturi_param(str): optional additional directionthreshold(float): optional threshold to find images similar >= threshold, default
Example:
Python
In case you use other models, need to check their requirements. Returns:
retint - 0 in case of success, otherwise 1.
search
filename(str): full path pointing to an image.img(PIL.Image): (Optional) loaded and resized PIL.Image, in case given it is not red from filenameverbose(bool): (Optiona) run in verbose mode, default is False
Returns:
ret(pd.DataFrame):Nonein case of error, otherwise apd.DataFramewithfrom,to,distancecolumns
vector_search
filename: vector name (used for debugging)vec(numpy): Mandatory numpy matrix of size1xdor a vector of sizedverbose(bool): (Optiona) run in verbose mode, default isFalse
ret(pd.DataFrame):Nonein case of error, otherwise apd.DataFramewithfrom,to,distance columns
invalid_instances
img_stats
data: add annotationsplit: filter by splitinclude_unannotated: include instances that are not represented in the annotations
config
connected_components
data: add annotationsplit: filter by splitinclude_unannotated: include instances that are not represented in the annotations
connected_components_grouped
sort_by: column to sort_by, on default give the largest components firstascending: sort_by ascending or descendingmetric: optional stats metric for the component for example blur, min (color), max (color), mean (color)
run
- Calculates a subset of images to analyze
- Runs fastdup
- Maps images/bounding boxes to fastdup index
- Expands annotation csv to include files that are not in annotation but is in subset
- Creates a version of annotation that is grouped by image
input_dir: input directory containing imagesannotations: (Optional) annotations file, the expected column convention is:- img_filename: input_dir-relative filenames
- img_h, img_w (Optional): image height and width
- col_x, row_y, width, height (Optional): bounding box arguments
- split (Optional): data split, e.g. train, test, etc …
subset: (Optional) subset of images to analyzeembeddings: (Optional) pre-calculated embeddingsdata_type: (Optional) data type, one of ‘infer’, ‘image’, ‘bbox’overwrite: (Optional) overwrite existing filesprint_summary: Print summary report of fastdup run resultsfastdup_kwargs: (Optional) fastdup run arguments, see fastdup.run() documentation
summary
blurry and dark/bright images. Count and percent for each. Arguments:
verbose:blur_threshold:brightness_threshold:darkness_threshold:
img_grouped_annot
set_fastdup_kwargs
input_kwargs: iunput kwargs to init function
fastdup_convert_to_relpath
work_dir: location of filesinput_dir: base dir for images
s3_folder_exists_and_not_empty
Folder should not be empty.
fastdup.fastdup_galleries
FastdupVisualizer Objects
__init__
controller: FastdupController instancedefault_config: dict of default config for cv2, e.g.{'cv2_imread_flag': cv2.IMREAD_COLOR}
outliers_gallery
save_path: html file-name to save the gallery or directory if lazy_load is True,
if None, save to fastdup work_dirnum_images: number of images to displaylazy_load: if True, load images on demand, otherwise load all images into htmllabel_col: column name of label in annotation dataframehow: (Optional) outlier selection method.- one = take the image that is far away from any one image
(but may have other images close to it). - all = take the image that is far away from all other images. Default is one.
slice: (Optional) parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width: max width of the gallerydraw_bbox: if True, draw bounding box on the imagesslice: (Optional) list/single label for filtering outlierssort_by: (Optional) column name to sort the outliers byascending: (Optional) sort ascending or descendingsave_artifacts: save artifacts to diskshow: show gallery in notebookkwargs: additional parameters to pass to create_outliers_gallery
duplicates_gallery
save_path: html file-name to save the gallery or directory if lazy_load is True,
if None, save to fastdup work_dirnum_images: number of images to displaydescending: display images with highest similarity firstlazy_load: load images on demand, otherwise load all images into htmllabel_col: column name of label in annotation dataframeslice: (Optional) parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width: max width of the gallerydraw_bbox: draw bounding box on the imagesslice: (Optional) list/single label for filtering outliersascending: (Optional) sort ascending or descendingthreshold: (Optional) threshold to filter out images with similarity score below the threshold.save_artifacts: save artifacts to diskshow: show gallery in notebookkwargs: additional parameters to pass to create_duplicates_gallery
similarity_gallery
save_path: html file-name to save the gallery or directory if lazy_load is True,
if None, save to fastdup work_dirnum_images: number of images to displaydescending: display images with highest similarity firstlazy_load: load images on demand, otherwise load all images into htmllabel_col: column name of label in annotation dataframeslice: (Optional) parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width: max width of the gallerydraw_bbox: draw bounding box on the imagesget_extra_col_func: (callable): Optional parameter to allow adding additional column to the reportthreshold: (Optional) threshold to filter out images with similarity score below the threshold.slice: (Optional) list/single label for filtering similarity dataframeascending: (Optional) sort ascending or descendingshow: show gallery in notebookkwargs: additional parameters to pass to create_duplicates_gallery
stats_gallery
save_path: html file-name to save the gallery or directory if lazy_load is True,
if None, save to fastdup work_dirmetric: metric to sort images by (dark, bright, blur)slice: list/single label for filtering stats dataframelabel_col: label column namelazy_load: load images on demand, otherwise load all images into htmlshow: show gallery in notebook
component_gallery
save_path: html file-name to save the gallerynum_images: number of images to displaylazy_load: load images on demand, otherwise load all images into htmllabel_col: column name of label in annotation dataframegroup_by: [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.slice: (Optional) parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width: max width of the gallerymin_items: threshold to filter out components with less than min_itemsmax_items: max number of items to display for each componentdraw_bbox: draw bounding box on the imagesget_extra_col_func: (callable): Optional parameter to allow adding additional column to the reportthreshold: (Optional) threshold to filter out images with similarity score below the threshold.metric: (Optional) parameter to set the metric to use (like blur) for chose components. Default is None.slice: (Optional) list/single label for filtering connected componentsort_by: (Optional) ‘area’|‘comp_size’|ANY_COLUMN_IN_ANNOTATION
column name to sort the connected component bysort_by_reduction: (Optional) ‘mean’|‘sum’ reduction method to use for grouping connected componentsascending: (Optional) sort ascending or descendingshow: show gallery in notebooksave_artifacts: save artifacts to disk
Data Enrichments
fastdup.caption
model_name: (Optional) select the model used for captioning or VQA (ViT-GPT2 by default)device: (Optional) select the processor used to compute captions (CPU by default)batch_size: (Optional) set the number of images to process in a single batch (8 by default)subset: (Optional) specify a subset of images to captionvqa_prompt: (Optional) provide a prompt for visual question answering- kwargs : additional parameters to pass to caption