Exploring Hugging Face: Text or Image-to-Video

Exploring Text and Image-to-Video AI

Published in

Dev Genius

3 min readApr 20, 2024

The Text or Image-to-Video task in Hugging Face involves the generation of videos from either textual descriptions or images.

For the Text-to-Video aspect, the process involves converting textual descriptions into video content. This can include generating scenes, animations, or complete videos based on the provided text. For example, given a story or script, the model can create a video that visually represents the narrative.

In the Image-to-Video aspect, the task revolves around creating videos from a sequence of images. This could be used for creating animations, slideshow-style videos, or even videos with added effects based on the images provided.

I will use a model from Hugging Face to generate a video based on a text prompt. I recommend using Google Colab for this demonstration.

!pip install torch torchvision
!pip install diffusers
!pip install accelerate

To have access to files in the drive:

from google.colab import drive
drive.mount('/content/drive')

To use a Hugging Face model in Colab, define a secret key named HF_TOKEN and set your Hugging Face token as the value.

I2VGen-XL is an advanced diffusion model designed for producing videos at higher resolutions compared to SVD. It goes beyond image inputs by accommodating text prompts as well. The model employs dual hierarchical encoders — detail and global encoders — to capture both intricate and overarching features within images effectively. These acquired features play a key role in training a specialized video diffusion model, refining video quality and enhancing details in the generated content.

import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import export_to_gif, load_image

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
pipeline.enable_model_cpu_offload()

image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
image = load_image(image_url).convert("RGB")

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = torch.manual_seed(8888)

frames = pipeline(
    prompt=prompt,
    image=image,
    num_inference_steps=50,
    negative_prompt=negative_prompt,
    guidance_scale=9.0,
    generator=generator
).frames[0]

An instance of I2VGenXLPipeline is created pre-trained on the "ali-vilab/i2vgen-xl" model. It specifies the use of 16-bit floating point numbers (FP16) for data, which reduces memory usage and can speed up computation, particularly on GPUs that support FP16.

The enable_model_cpu_offload() method is called to offload parts of the model to CPU, which is useful for managing GPU memory usage more efficiently.

export_to_gif(frames, "/content/drive/My Drive/Colab Notebooks/i2v.gif")

from IPython.display import Image
display(Image(filename="/content/drive/My Drive/Colab Notebooks/i2v.gif"))