How to parse extracted data using Visual Layer's API.
This is an example of how to parse Image Data, Duplicates and Mislabels
Load the exported metadata
First lets load the exported metadata into a pandas dataframe.
import pandas as pd
import json
with open("food101/metadata.json") as f:
data = json.load(f)
# This is the dataset level information
info_df = pd.DataFrame([data["info"]])
# This is the image level information
media_items_df = pd.json_normalize(data["media_items"])
View the dataset level information
info_df
schema_version | dataset | description | dataset_url | export_time | dataset_creation_time | exported_by | total_media_items |
---|---|---|---|---|---|---|---|
1.1 | food101 | Exported from food101 at Visual Layer | Link | 2025-02-10T14:29:53.740569 | 2024-12-05T05:55:27.725598 | Dickson Neoh | 118 |
Get Image Level Details
Each row in the dataframe corresponds to an image.
media_items_df
media_id | media_type | file_name | file_path | file_size | uniqueness_score | height | width | url | cluster_id | metadata_items |
---|---|---|---|---|---|---|---|---|---|---|
d5227901-22c9-4744-a264-407d9671aa4a | image | 548938.jpg | 548938.jpg | 32.00KB | 0.004178 | 512 | 512 | Link | fbcad8ef-d863-46c9-83b7-1a3bd85e2e2b | [{'type': 'issue', 'properties': {'issue_type'... |
2546c70a-e0a4-4bfb-ac59-e2895bb96456 | image | 548231.jpg | 548231.jpg | 32.00KB | 0.006069 | 512 | 512 | Link | fbcad8ef-d863-46c9-83b7-1a3bd85e2e2b | [{'type': 'issue', 'properties': {'issue_type'... |
45c226e0-daba-4ca8-8eac-fec9d490ea36 | image | 835953.jpg | 835953.jpg | 28.18KB | 0.003515 | 512 | 384 | Link | 5bc63415-76d8-49ec-975a-9a021bf98770 | [{'type': 'issue', 'properties': {'issue_type'... |
fb4f2430-fb42-4b29-9c12-6b89afee4c5a | image | 881518.jpg | 881518.jpg | 28.18KB | 0.008515 | 512 | 384 | Link | 5bc63415-76d8-49ec-975a-9a021bf98770 | [{'type': 'issue', 'properties': {'issue_type'... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
118 rows total |
Filter By Uniqueness Score
By filtering by uniqueness score, we can get a representative sample of the images in the dataset.
UNIQUENESS_SCORE_THRESHOLD = 0.85
coreset_df = media_items_df[
media_items_df["uniqueness_score"] > UNIQUENESS_SCORE_THRESHOLD
]
coreset_df
Filtered Media Items
The table below shows media items filtered by uniqueness score.
media_id | media_type | file_name | file_size | uniqueness_score | height | width | url |
---|---|---|---|---|---|---|---|
9cabe897-ca35-4164-9f6d-f1d34efa80ff | image | 620711.jpg | 25.36KB | 0.903683 | 307 | 512 | Link |
e21cb975-8a4d-4274-80b7-c2f9d827a4d2 | image | 50022.jpg | 28.39KB | 0.955931 | 384 | 512 | Link |
0d02845e-bd57-4ea2-bb2a-95c640254e4d | image | 50036.jpg | 28.39KB | 0.959406 | 384 | 512 | Link |
9bb465d3-c5fa-4c2c-8a29-1a6c81d613b1 | image | 3217591.jpg | 32.33KB | 0.933178 | 384 | 512 | Link |
0e9fd046-9135-4ce9-bb23-c1adb71f73e4 | image | 2399575.jpg | 37.75KB | 0.954673 | 384 | 512 | Link |
018344e8-e8ac-4f67-ad4a-593f7bbae70a | image | 1590716.jpg | 22.09KB | 0.922416 | 288 | 512 | Link |
73fbc779-5b18-45cf-808e-25311bde48e4 | image | 2619753.jpg | 47.37KB | 0.911901 | 512 | 512 | Link |
6da7402b-3f0d-4d23-aca2-08504fc01a2c | image | 501296.jpg | 48.96KB | 0.961545 | 512 | 512 | Link |
c1691c6f-6b39-4c26-9c10-ac4cda98c401 | image | 1552336.jpg | 26.43KB | 0.928079 | 512 | 384 | Link |
be357442-ee51-4bc6-9166-4c752d9a7294 | image | 1638436.jpg | 30.97KB | 0.909931 | 384 | 512 | Link |
0a6fc26d-c05a-447e-b312-50705ff6ad8c | image | 4787.jpg | 41.83KB | 0.960010 | 512 | 384 | Link |
c857b409-47c7-4823-ad65-98bc891836f0 | image | 51112.jpg | 41.83KB | 0.960634 | 512 | 384 | Link |
069a4de6-ca29-489c-a4d7-2bfe001d269d | image | 1566367.jpg | 25.08KB | 0.991139 | 512 | 384 | Link |
78bc4624-7ba6-4c22-8d37-f48664194aa7 | image | 2199941.jpg | 53.11KB | 0.960000 | 512 | 512 | Link |
7598ec52-3f78-4ce7-9ebd-9658c7601f75 | image | 85516.jpg | 45.26KB | 0.975198 | 512 | 384 | Link |
e5ab0cb0-4c83-4062-a6d9-12abe3ad12b0 | image | 1486281.jpg | 17.19KB | 0.949752 | 512 | 307 | Link |
65f6e117-a067-4c56-8920-533980d516dd | image | 1527126.jpg | 17.19KB | 0.948040 | 512 | 307 | Link |
This table includes media items with a uniqueness score greater than 0.85.
Get Duplicate Images
The metadata_items
column contains a list of issues for each image. We can filter for images with duplicate issues above a certain confidence threshold.
def has_duplicate_issue(metadata_items, confidence_threshold=0.8):
if not isinstance(metadata_items, list):
return False
for item in metadata_items:
if (
item.get("type") == "issue"
and item.get("properties", {}).get("issue_type") == "duplicates"
and item.get("properties", {}).get("confidence", 0) > confidence_threshold
):
return True
return False
# Replace with your confidence threshold
CONFIDENCE_THRESHOLD = 0.8
duplicate_df = media_items_df[
media_items_df["metadata_items"].apply(
lambda x: has_duplicate_issue(x, confidence_threshold=CONFIDENCE_THRESHOLD)
)
]
duplicate_df
Filtered Media Items
media_id | media_type | file_name | file_path | file_size | uniqueness_score | height | width | url | cluster_id | metadata_items |
---|---|---|---|---|---|---|---|---|---|---|
d5227901-22c9-4744-a264-407d9671aa4a | image | 548938.jpg | 548938.jpg | 32.00KB | 0.004178 | 512 | 512 | Link | fbcad8ef-d863-46c9-83b7-1a3bd85e2e2b | [{'type': 'issue', 'properties': {'issue_type'... |
2546c70a-e0a4-4bfb-ac59-e2895bb96456 | image | 548231.jpg | 548231.jpg | 32.00KB | 0.006069 | 512 | 512 | Link | fbcad8ef-d863-46c9-83b7-1a3bd85e2e2b | [{'type': 'issue', 'properties': {'issue_type'... |
45c226e0-daba-4ca8-8eac-fec9d490ea36 | image | 835953.jpg | 835953.jpg | 28.18KB | 0.003515 | 512 | 384 | Link | 5bc63415-76d8-49ec-975a-9a021bf98770 | [{'type': 'issue', 'properties': {'issue_type'... |
fb4f2430-fb42-4b29-9c12-6b89afee4c5a | image | 881518.jpg | 881518.jpg | 28.18KB | 0.008515 | 512 | 384 | Link | 5bc63415-76d8-49ec-975a-9a021bf98770 | [{'type': 'issue', 'properties': {'issue_type'... |
9cabe897-ca35-4164-9f6d-f1d34efa80ff | image | 620711.jpg | 620711.jpg | 25.36KB | 0.903683 | 307 | 512 | Link | 7815102e-c206-46be-9a77-b7c0ae6a729c | [{'type': 'issue', 'properties': {'issue_type'... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5eb2c044-3f0d-46eb-a164-d98440cf3d1f | image | 2671994.jpg | 2671994.jpg | 37.47KB | 0.003525 | 512 | 512 | Link | 9779daed-a923-4b52-8564-a54c5602c934 | [{'type': 'issue', 'properties': {'issue_type'... |
e5ab0cb0-4c83-4062-a6d9-12abe3ad12b0 | image | 1486281.jpg | 1486281.jpg | 17.19KB | 0.949752 | 512 | 307 | Link | f219d30e-bc8b-45a2-8971-6225f3160741 | [{'type': 'issue', 'properties': {'issue_type'... |
65f6e117-a067-4c56-8920-533980d516dd | image | 1527126.jpg | 1527126.jpg | 17.19KB | 0.948040 | 512 | 307 | Link | f219d30e-bc8b-45a2-8971-6225f3160741 | [{'type': 'issue', 'properties': {'issue_type'... |
2317ca31-2856-4318-aeb6-550ff2bfbe8b | image | 1103647.jpg | 1103647.jpg | 21.74KB | 0.012366 | 306 | 512 | Link | 94465018-63b2-439a-ab91-dd0dd1bec49f | [{'type': 'issue', 'properties': {'issue_type'... |
881847be-3e3f-48ce-952d-83d40d356ec0 | image | 1103636.jpg | 1103636.jpg | 21.74KB | 0.015446 | 306 | 512 | Link | 94465018-63b2-439a-ab91-dd0dd1bec49f | [{'type': 'issue', 'properties': {'issue_type'... |
This table includes images with duplicate issues above a confidence threshold 0f 0.8 .
Get Mislabels
We can filter for images with mislabel issues above a certain confidence threshold.
def has_mislabel_issue(metadata_items, confidence_threshold=0.8):
if not isinstance(metadata_items, list):
return False
for item in metadata_items:
if (
item.get("type") == "issue"
and item.get("properties", {}).get("issue_type") == "mislabels"
and item.get("properties", {}).get("confidence", 0) > confidence_threshold
):
return True
return False
# Replace with your confidence threshold
CONFIDENCE_THRESHOLD = 0.8
mislabel_df = media_items_df[
media_items_df["metadata_items"].apply(
lambda x: has_mislabel_issue(x, confidence_threshold=CONFIDENCE_THRESHOLD)
)
]
mislabel_df
media_id | media_type | file_name | file_path | file_size | uniqueness_score | height | width | url | cluster_id | metadata_items |
---|---|---|---|---|---|---|---|---|---|---|
41 | image | 2619752.jpg | 2619752.jpg | 47.37KB | 0.008683 | 512 | 512 | Link | f06ea82f-6df9-4224-8fda-4c87ce7ae202 | [{'type': 'issue', 'properties': {'issue_type'... |
Updated about 1 month ago