Guides

How to parse extracted data using Visual Layer's API.

This is an example of how to parse Image Data, Duplicates and Mislabels

Load the exported metadata

First lets load the exported metadata into a pandas dataframe.

import pandas as pd
import json

with open("food101/metadata.json") as f:
    data = json.load(f)


# This is the dataset level information
info_df = pd.DataFrame([data["info"]])

# This is the image level information
media_items_df = pd.json_normalize(data["media_items"])

View the dataset level information

info_df
schema_versiondatasetdescriptiondataset_urlexport_timedataset_creation_timeexported_bytotal_media_items
1.1food101Exported from food101 at Visual LayerLink2025-02-10T14:29:53.7405692024-12-05T05:55:27.725598Dickson Neoh118

Get Image Level Details

Each row in the dataframe corresponds to an image.

media_items_df

media_idmedia_typefile_namefile_pathfile_sizeuniqueness_scoreheightwidthurlcluster_idmetadata_items
d5227901-22c9-4744-a264-407d9671aa4aimage548938.jpg548938.jpg32.00KB0.004178512512Linkfbcad8ef-d863-46c9-83b7-1a3bd85e2e2b[{'type': 'issue', 'properties': {'issue_type'...
2546c70a-e0a4-4bfb-ac59-e2895bb96456image548231.jpg548231.jpg32.00KB0.006069512512Linkfbcad8ef-d863-46c9-83b7-1a3bd85e2e2b[{'type': 'issue', 'properties': {'issue_type'...
45c226e0-daba-4ca8-8eac-fec9d490ea36image835953.jpg835953.jpg28.18KB0.003515512384Link5bc63415-76d8-49ec-975a-9a021bf98770[{'type': 'issue', 'properties': {'issue_type'...
fb4f2430-fb42-4b29-9c12-6b89afee4c5aimage881518.jpg881518.jpg28.18KB0.008515512384Link5bc63415-76d8-49ec-975a-9a021bf98770[{'type': 'issue', 'properties': {'issue_type'...
.................................
118 rows total

Filter By Uniqueness Score

By filtering by uniqueness score, we can get a representative sample of the images in the dataset.

UNIQUENESS_SCORE_THRESHOLD = 0.85

coreset_df = media_items_df[
    media_items_df["uniqueness_score"] > UNIQUENESS_SCORE_THRESHOLD
]
coreset_df

Filtered Media Items

The table below shows media items filtered by uniqueness score.

media_idmedia_typefile_namefile_sizeuniqueness_scoreheightwidthurl
9cabe897-ca35-4164-9f6d-f1d34efa80ffimage620711.jpg25.36KB0.903683307512Link
e21cb975-8a4d-4274-80b7-c2f9d827a4d2image50022.jpg28.39KB0.955931384512Link
0d02845e-bd57-4ea2-bb2a-95c640254e4dimage50036.jpg28.39KB0.959406384512Link
9bb465d3-c5fa-4c2c-8a29-1a6c81d613b1image3217591.jpg32.33KB0.933178384512Link
0e9fd046-9135-4ce9-bb23-c1adb71f73e4image2399575.jpg37.75KB0.954673384512Link
018344e8-e8ac-4f67-ad4a-593f7bbae70aimage1590716.jpg22.09KB0.922416288512Link
73fbc779-5b18-45cf-808e-25311bde48e4image2619753.jpg47.37KB0.911901512512Link
6da7402b-3f0d-4d23-aca2-08504fc01a2cimage501296.jpg48.96KB0.961545512512Link
c1691c6f-6b39-4c26-9c10-ac4cda98c401image1552336.jpg26.43KB0.928079512384Link
be357442-ee51-4bc6-9166-4c752d9a7294image1638436.jpg30.97KB0.909931384512Link
0a6fc26d-c05a-447e-b312-50705ff6ad8cimage4787.jpg41.83KB0.960010512384Link
c857b409-47c7-4823-ad65-98bc891836f0image51112.jpg41.83KB0.960634512384Link
069a4de6-ca29-489c-a4d7-2bfe001d269dimage1566367.jpg25.08KB0.991139512384Link
78bc4624-7ba6-4c22-8d37-f48664194aa7image2199941.jpg53.11KB0.960000512512Link
7598ec52-3f78-4ce7-9ebd-9658c7601f75image85516.jpg45.26KB0.975198512384Link
e5ab0cb0-4c83-4062-a6d9-12abe3ad12b0image1486281.jpg17.19KB0.949752512307Link
65f6e117-a067-4c56-8920-533980d516ddimage1527126.jpg17.19KB0.948040512307Link

This table includes media items with a uniqueness score greater than 0.85.

Get Duplicate Images

The metadata_items column contains a list of issues for each image. We can filter for images with duplicate issues above a certain confidence threshold.

def has_duplicate_issue(metadata_items, confidence_threshold=0.8):
    if not isinstance(metadata_items, list):
        return False

    for item in metadata_items:
        if (
            item.get("type") == "issue"
            and item.get("properties", {}).get("issue_type") == "duplicates"
            and item.get("properties", {}).get("confidence", 0) > confidence_threshold
        ):
            return True
    return False


# Replace with your confidence threshold
CONFIDENCE_THRESHOLD = 0.8

duplicate_df = media_items_df[
    media_items_df["metadata_items"].apply(
        lambda x: has_duplicate_issue(x, confidence_threshold=CONFIDENCE_THRESHOLD)
    )
]

duplicate_df


Filtered Media Items

media_idmedia_typefile_namefile_pathfile_sizeuniqueness_scoreheightwidthurlcluster_idmetadata_items
d5227901-22c9-4744-a264-407d9671aa4aimage548938.jpg548938.jpg32.00KB0.004178512512Linkfbcad8ef-d863-46c9-83b7-1a3bd85e2e2b[{'type': 'issue', 'properties': {'issue_type'...
2546c70a-e0a4-4bfb-ac59-e2895bb96456image548231.jpg548231.jpg32.00KB0.006069512512Linkfbcad8ef-d863-46c9-83b7-1a3bd85e2e2b[{'type': 'issue', 'properties': {'issue_type'...
45c226e0-daba-4ca8-8eac-fec9d490ea36image835953.jpg835953.jpg28.18KB0.003515512384Link5bc63415-76d8-49ec-975a-9a021bf98770[{'type': 'issue', 'properties': {'issue_type'...
fb4f2430-fb42-4b29-9c12-6b89afee4c5aimage881518.jpg881518.jpg28.18KB0.008515512384Link5bc63415-76d8-49ec-975a-9a021bf98770[{'type': 'issue', 'properties': {'issue_type'...
9cabe897-ca35-4164-9f6d-f1d34efa80ffimage620711.jpg620711.jpg25.36KB0.903683307512Link7815102e-c206-46be-9a77-b7c0ae6a729c[{'type': 'issue', 'properties': {'issue_type'...
.................................
5eb2c044-3f0d-46eb-a164-d98440cf3d1fimage2671994.jpg2671994.jpg37.47KB0.003525512512Link9779daed-a923-4b52-8564-a54c5602c934[{'type': 'issue', 'properties': {'issue_type'...
e5ab0cb0-4c83-4062-a6d9-12abe3ad12b0image1486281.jpg1486281.jpg17.19KB0.949752512307Linkf219d30e-bc8b-45a2-8971-6225f3160741[{'type': 'issue', 'properties': {'issue_type'...
65f6e117-a067-4c56-8920-533980d516ddimage1527126.jpg1527126.jpg17.19KB0.948040512307Linkf219d30e-bc8b-45a2-8971-6225f3160741[{'type': 'issue', 'properties': {'issue_type'...
2317ca31-2856-4318-aeb6-550ff2bfbe8bimage1103647.jpg1103647.jpg21.74KB0.012366306512Link94465018-63b2-439a-ab91-dd0dd1bec49f[{'type': 'issue', 'properties': {'issue_type'...
881847be-3e3f-48ce-952d-83d40d356ec0image1103636.jpg1103636.jpg21.74KB0.015446306512Link94465018-63b2-439a-ab91-dd0dd1bec49f[{'type': 'issue', 'properties': {'issue_type'...

This table includes images with duplicate issues above a confidence threshold 0f 0.8 .

Get Mislabels

We can filter for images with mislabel issues above a certain confidence threshold.

def has_mislabel_issue(metadata_items, confidence_threshold=0.8):
    if not isinstance(metadata_items, list):
        return False

    for item in metadata_items:
        if (
            item.get("type") == "issue"
            and item.get("properties", {}).get("issue_type") == "mislabels"
            and item.get("properties", {}).get("confidence", 0) > confidence_threshold
        ):
            return True
    return False


# Replace with your confidence threshold
CONFIDENCE_THRESHOLD = 0.8

mislabel_df = media_items_df[
    media_items_df["metadata_items"].apply(
        lambda x: has_mislabel_issue(x, confidence_threshold=CONFIDENCE_THRESHOLD)
    )
]


mislabel_df

media_idmedia_typefile_namefile_pathfile_sizeuniqueness_scoreheightwidthurlcluster_idmetadata_items
41image2619752.jpg2619752.jpg47.37KB0.008683512512Linkf06ea82f-6df9-4224-8fda-4c87ce7ae202[{'type': 'issue', 'properties': {'issue_type'...