Exploring Hugging Face: Image Classification

Image Classification Task

Okan Yenigün

Published in

Dev Genius

3 min readMay 6, 2024

The image classification task involves assigning a label or class to an image based on its visual content.

Let’s try a model. I will use a mountain image as input.

from transformers import pipeline
clf = pipeline("image-classification")
clf("mountain.jpg")

"""
[{'label': 'alp', 'score': 0.8358551263809204},
 {'label': 'valley, vale', 'score': 0.14238341152668},
 {'label': 'mountain tent', 'score': 0.006402834318578243},
 {'label': 'volcano', 'score': 0.00502895750105381},
 {'label': 'lakeside, lakeshore', 'score': 0.0014874241314828396}]
"""

Here, the pipeline function is called with the argument "image-classification", which specifies that we want to set up a pipeline for classifying images. The function returns a pre-configured pipeline object (clf), which is ready to be used for image classification.

The output is a list of dictionaries, each representing a potential label for the image and the model’s confidence (score) in that label. Each dictionary contains:

label: A string representing the class label assigned to the image.
score: A float representing the confidence score of the prediction. This value lies between 0 and 1, with higher values indicating greater confidence.

The results show that the model is most confident that the image is of an “alp” with about 83.6% confidence. Other possible classifications are provided with lower confidence scores, such as “valley, vale”, “mountain tent”, “volcano”, and “lakeside, lakeshore”.

Let’s try another model with a different input:

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

"""
Predicted class: Egyptian cat
"""

ViTImageProcessor and ViTForImageClassification are classes from the transformers library specifically designed for preprocessing images and classifying them using the Vision Transformer model.

ViTImageProcessor.from_pretrained initializes an image processor with the necessary configuration for the specified Vision Transformer model ('google/vit-base-patch16-224'). This processor handles tasks like resizing, normalizing, and formatting the image to match the input requirements of the model.

ViTForImageClassification.from_pretrained loads the Vision Transformer model pre-trained on ImageNet, prepared for image classification.

Another model:

from PIL import Image
import torch
from aim.utils import load_pretrained
from aim.torch.data import val_transforms

# Load your image here; replace '...' with the path to your image file
img = Image.open('mountain.jpg')

# Load the pretrained model
model = load_pretrained("aim-600M-2B-imgs", backend="torch")

# Get the validation transforms (make sure this function returns a valid transformation for your model)
transform = val_transforms()

# Transform the image and add batch dimension
inp = transform(img).unsqueeze(0)

# Pass the image through the model to get logits
logits, _ = model(inp)

# Apply softmax to convert logits to probabilities
probabilities = torch.softmax(logits, dim=1)

# Get the predicted class index
predicted_class = torch.argmax(probabilities, dim=1)

# Print the predicted class index
print("Predicted class index:", predicted_class.item())

"""
Predicted class index: 793
"""