Exploring Hugging Face: Image Classification
Image Classification Task
The image classification task involves assigning a label or class to an image based on its visual content.
Let’s try a model. I will use a mountain image as input.
from transformers import pipeline
clf = pipeline("image-classification")
clf("mountain.jpg")
"""
[{'label': 'alp', 'score': 0.8358551263809204},
{'label': 'valley, vale', 'score': 0.14238341152668},
{'label': 'mountain tent', 'score': 0.006402834318578243},
{'label': 'volcano', 'score': 0.00502895750105381},
{'label': 'lakeside, lakeshore', 'score': 0.0014874241314828396}]
"""
Here, the pipeline
function is called with the argument "image-classification"
, which specifies that we want to set up a pipeline for classifying images. The function returns a pre-configured pipeline object (clf
), which is ready to be used for image classification.
The output is a list of dictionaries, each representing a potential label for the image and the model’s confidence (score) in that label. Each dictionary contains:
- label: A string representing the class label assigned to the image.
- score: A float representing the confidence score of the prediction. This value lies between 0 and 1, with higher values indicating greater confidence.
The results show that the model is most confident that the image is of an “alp” with about 83.6% confidence. Other possible classifications are provided with lower confidence scores, such as “valley, vale”, “mountain tent”, “volcano”, and “lakeside, lakeshore”.
Let’s try another model with a different input:
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
"""
Predicted class: Egyptian cat
"""
ViTImageProcessor
and ViTForImageClassification
are classes from the transformers
library specifically designed for preprocessing images and classifying them using the Vision Transformer model.
ViTImageProcessor.from_pretrained
initializes an image processor with the necessary configuration for the specified Vision Transformer model ('google/vit-base-patch16-224'
). This processor handles tasks like resizing, normalizing, and formatting the image to match the input requirements of the model.
ViTForImageClassification.from_pretrained
loads the Vision Transformer model pre-trained on ImageNet, prepared for image classification.
Another model:
from PIL import Image
import torch
from aim.utils import load_pretrained
from aim.torch.data import val_transforms
# Load your image here; replace '...' with the path to your image file
img = Image.open('mountain.jpg')
# Load the pretrained model
model = load_pretrained("aim-600M-2B-imgs", backend="torch")
# Get the validation transforms (make sure this function returns a valid transformation for your model)
transform = val_transforms()
# Transform the image and add batch dimension
inp = transform(img).unsqueeze(0)
# Pass the image through the model to get logits
logits, _ = model(inp)
# Apply softmax to convert logits to probabilities
probabilities = torch.softmax(logits, dim=1)
# Get the predicted class index
predicted_class = torch.argmax(probabilities, dim=1)
# Print the predicted class index
print("Predicted class index:", predicted_class.item())
"""
Predicted class index: 793
"""
Read More
Sources
https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads
https://huggingface.co/apple/AIM