> ## Documentation Index
> Fetch the complete documentation index at: https://docs.visual-layer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Analyzing Image Classification Dataset

> In this tutorial we use fastdup to analyze a labelled image classification dataset for potential issues.

# Setting Up

You can follow along this tutorial by running [this notebook on Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analysing-image-classification-dataset.ipynb).

First, install fastdup with:

```Text bash theme={"theme":"monokai"}
pip install fastdup
```

And verify the installation.

```Text Python theme={"theme":"monokai"}
import fastdup
fastdup.__version__
```

This tutorial runs on version `0.906`.

# Download Dataset

We will be analyzing the Imagenette dataset -  A 10-class ImageNet subset from [fast.ai](https://www.fast.ai/).

Imagenette consists of 10 classes from the original ImageNet dataset. It contains 13k images in total.

Download and extract dataset:

```Text shell theme={"theme":"monokai"}
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz  
!tar -xzf imagenette2-160.tgz
```

Once done you should have a folder with the following structure.

```
./imagenette2-160/
├── noisy_imagenette.csv
├── train
│   ├── n01440764 
│   ├── n02102040 
│   ├── n02979186 
│   ├── n03000684 
│   ├── n03028079 
│   ├── n03394916 
│   ├── n03417042 
│   ├── n03425413 
│   ├── n03445777 
│   └── n03888257 
└── val
    ├── n01440764 
    ├── n02102040 
    ├── n02979186 
    ├── n03000684 
    ├── n03028079 
    ├── n03394916 
    ├── n03417042 
    ├── n03425413
    ├── n03445777 
    └── n03888257 
```

> 📘 Naming
>
> * `noisy_imagenette.csv` - A `.csv` files with labels.
> * `train/` - Train images.
> * `val/` - validation images.

# Load & Format Annotations

We'll use `pandas` to load and format the annotations.

```python theme={"theme":"monokai"}
import pandas as pd

data_dir = 'imagenette2-160/'
csv_path = 'imagenette2-160/noisy_imagenette.csv'
```

As ImageNet uses codes for classes, we'll map them to their human readable values for ease of analysis ([source](https://gist.githubusercontent.com/aaronpolhamus/964a4411c0906315deb9f4a3723aac57/raw/aa66dd9dbf6b56649fa3fab83659b2acbf3cbfd1/map_clsloc.txt)):

```python theme={"theme":"monokai"}
label_map = {
    'n02979186': 'cassette_player', 
    'n03417042': 'garbage_truck', 
    'n01440764': 'tench', 
    'n02102040': 'English_springer', 
    'n03028079': 'church',
    'n03888257': 'parachute', 
    'n03394916': 'French_horn', 
    'n03000684': 'chain_saw', 
    'n03445777': 'golf_ball', 
    'n03425413': 'gas_pump'
}
```

Let's load the annotations.

```python theme={"theme":"monokai"}
df_annot = pd.read_csv(csv_path)
df_annot.head(3)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th />

      <th>path</th>
      <th>noisy\_labels\_0</th>
      <th>noisy\_labels\_1</th>
      <th>noisy\_labels\_5</th>
      <th>noisy\_labels\_25</th>
      <th>noisy\_labels\_50</th>
      <th>is\_valid</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th>0</th>
      <td>train/n02979186/n02979186\_9036.JPEG</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>False</td>
    </tr>

    <tr>
      <th>1</th>
      <td>train/n02979186/n02979186\_11957.JPEG</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n03000684</td>
      <td>False</td>
    </tr>

    <tr>
      <th>2</th>
      <td>train/n02979186/n02979186\_9715.JPEG</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n02979186</td>
      <td>n03417042</td>
      <td>n03000684</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

Transform the annotation into a format expected by fastdup.

```python theme={"theme":"monokai"}
# take relevant columns
df_annot = df_annot[['path', 'noisy_labels_0']]

# rename columns to fastdup's column names
df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'img_filename'}, axis='columns')

# create split column
df_annot['split'] = df_annot['img_filename'].apply(lambda x: x.split("/")[0])

# map label ids to regular labels
df_annot['label'] = df_annot['label'].map(label_map)

# show formated annotations
df_annot.head()
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th />

      <th>img\_filename</th>
      <th>label</th>
      <th>split</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th>0</th>
      <td>train/n02979186/n02979186\_9036.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
    </tr>

    <tr>
      <th>1</th>
      <td>train/n02979186/n02979186\_11957.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
    </tr>

    <tr>
      <th>2</th>
      <td>train/n02979186/n02979186\_9715.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
    </tr>

    <tr>
      <th>3</th>
      <td>train/n02979186/n02979186\_21736.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
    </tr>

    <tr>
      <th>4</th>
      <td>train/n02979186/ILSVRC2012\_val\_00046953.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
    </tr>
  </tbody>
</table>

# Run fastdup

With the annotations and folders in the right format, let's run fastdup and analyze the dataset.

```python theme={"theme":"monokai"}
import fastdup
work_dir = 'fastdup_imagenette'

fd = fastdup.create(work_dir=work_dir, input_dir=data_dir) 
fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8)
```

> 📘 Info
>
> * `ccthreshold` - Threshold to use for the graph connected component. Default is 0.96.
> * `threshold` -  Threshold to use for the graph generation. Default is 0.9.

Get a summary of the run.

```python theme={"theme":"monokai"}
fd.summary()
```

```
Dataset Analysis Summary: 

    Dataset contains 13394 images
    Valid images are 100.00% (13,394) of the data, invalid are 0.00% (0) of the data
    Similarity:  2.73% (366) belong to 20 similarity clusters (components).
    97.27% (13,028) images do not belong to any similarity cluster.
    Largest cluster has 40 (0.30%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.8, connected component threshold used is 0.9).

    Outliers: 6.20% (830) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.
['Dataset contains 13394 images',
 'Valid images are 100.00% (13,394) of the data, invalid are 0.00% (0) of the data',
 'Similarity:  2.73% (366) belong to 20 similarity clusters (components).',
 '97.27% (13,028) images do not belong to any similarity cluster.',
 'Largest cluster has 40 (0.30%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.8, connected component threshold used is 0.9).\n',
 'Outliers: 6.20% (830) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']
```

> 🚧 830 possible outliers
>
> From the summary there are 830 possible outliers found. Let's inspect them further!

# Outliers

Let's visualize the outliers in the dataset.

```
fd.vis.outliers_gallery()
```

> 📘 Label information
>
> Note the label information in the outliers report.

![](https://files.readme.io/f1e949e-_home_dnth_Downloads_outliers.html.png)

> 👍 Anomalies?
>
> Can you spot any anomalies from the report?
>
> **Hint**  - `parachute` and `golf ball`.

To get more details on the outliers, run:

```python theme={"theme":"monokai"}
fd.outliers().head()
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th />

      <th>index</th>
      <th>outlier</th>
      <th>nearest</th>
      <th>distance</th>
      <th>img\_filename\_outlier</th>
      <th>label\_outlier</th>
      <th>split\_outlier</th>
      <th>error\_code\_outlier</th>
      <th>is\_valid\_outlier</th>
      <th>img\_filename\_nearest</th>
      <th>label\_nearest</th>
      <th>split\_nearest</th>
      <th>error\_code\_nearest</th>
      <th>is\_valid\_nearest</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th>0</th>
      <td>1338</td>
      <td>12009</td>
      <td>1757</td>
      <td>0.469904</td>
      <td>val/n03417042/n03417042\_29412.JPEG</td>
      <td>garbage\_truck</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n02102040/n02102040\_7256.JPEG</td>
      <td>English\_springer</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>1</th>
      <td>1336</td>
      <td>2664</td>
      <td>9763</td>
      <td>0.476124</td>
      <td>train/n02979186/n02979186\_3967.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>val/n01440764/n01440764\_710.JPEG</td>
      <td>tench</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>2</th>
      <td>1335</td>
      <td>2727</td>
      <td>1571</td>
      <td>0.476313</td>
      <td>train/n02979186/n02979186\_5424.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n02102040/n02102040\_536.JPEG</td>
      <td>English\_springer</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>3</th>
      <td>1333</td>
      <td>12172</td>
      <td>1817</td>
      <td>0.479290</td>
      <td>val/n03417042/n03417042\_91.JPEG</td>
      <td>garbage\_truck</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n02102040/n02102040\_7868.JPEG</td>
      <td>English\_springer</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>4</th>
      <td>1332</td>
      <td>1981</td>
      <td>10098</td>
      <td>0.479516</td>
      <td>train/n02979186/n02979186\_10387.JPEG</td>
      <td>cassette\_player</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>val/n02102040/n02102040\_5272.JPEG</td>
      <td>English\_springer</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
    </tr>
  </tbody>
</table>

# Similarity Gallery

Other than the outliers, we can find possible mislabels by comparing a query image to other images in the dataset.

This can be done with:

```python theme={"theme":"monokai"}
fd.vis.similarity_gallery() 
```

![](https://files.readme.io/215de29-_home_dnth_Downloads_similarity.html.png)

To get a detailed information

```
fd.similarity().head(5)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th />

      <th>from</th>
      <th>to</th>
      <th>distance</th>
      <th>img\_filename\_from</th>
      <th>label\_from</th>
      <th>split\_from</th>
      <th>error\_code\_from</th>
      <th>is\_valid\_from</th>
      <th>img\_filename\_to</th>
      <th>label\_to</th>
      <th>split\_to</th>
      <th>error\_code\_to</th>
      <th>is\_valid\_to</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th>0</th>
      <td>11521</td>
      <td>5390</td>
      <td>0.968786</td>
      <td>val/n03394916/n03394916\_30631.JPEG</td>
      <td>French\_horn</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n03394916/n03394916\_44127.JPEG</td>
      <td>French\_horn</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>1</th>
      <td>5390</td>
      <td>11521</td>
      <td>0.968786</td>
      <td>train/n03394916/n03394916\_44127.JPEG</td>
      <td>French\_horn</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>val/n03394916/n03394916\_30631.JPEG</td>
      <td>French\_horn</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>2</th>
      <td>12914</td>
      <td>7715</td>
      <td>0.962458</td>
      <td>val/n03445777/n03445777\_6882.JPEG</td>
      <td>golf\_ball</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n03445777/n03445777\_13918.JPEG</td>
      <td>golf\_ball</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>3</th>
      <td>7715</td>
      <td>12914</td>
      <td>0.962458</td>
      <td>train/n03445777/n03445777\_13918.JPEG</td>
      <td>golf\_ball</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>val/n03445777/n03445777\_6882.JPEG</td>
      <td>golf\_ball</td>
      <td>val</td>
      <td>VALID</td>
      <td>True</td>
    </tr>

    <tr>
      <th>4</th>
      <td>1117</td>
      <td>1404</td>
      <td>0.953837</td>
      <td>train/n02102040/n02102040\_1564.JPEG</td>
      <td>English\_springer</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
      <td>train/n02102040/n02102040\_3837.JPEG</td>
      <td>English\_springer</td>
      <td>train</td>
      <td>VALID</td>
      <td>True</td>
    </tr>
  </tbody>
</table>

# Duplicates

Let's also check for duplicate pairs.

```python theme={"theme":"monokai"}
fd.vis.duplicates_gallery()
```

![](https://files.readme.io/4e946ba-_home_dnth_Downloads_duplicates.html.png)

# Image Clusters

```python theme={"theme":"monokai"}
fd.vis.component_gallery(num_images=5)
```

![](https://files.readme.io/71fc93a-_home_dnth_Downloads_components.html.png)

You can also visualize clusters with specific labels using the `slice` parameter. For example let's visualize clusters with the `chain_saw` label.

```
fd.vis.component_gallery(slice='chain_saw')
```

![](https://files.readme.io/bb2491b-_home_dnth_Downloads_components.html.png)

# Summary

> 👍 We now added several important label-specific features to run on top of the raw dataset
>
> Building over the image dataset analysis conducted without labels in the [Cleaning and preparing a dataset](#abc) tutorial, we've seen how to further slice the data and visualize specific classes of interest.
