Exploring Hugging Face: Text or Image-to-Video

Exploring Text and Image-to-Video AI

Okan Yenigün
Dev Genius

--

Photo by Jakob Owens on Unsplash

The Text or Image-to-Video task in Hugging Face involves the generation of videos from either textual descriptions or images.

For the Text-to-Video aspect, the process involves converting textual descriptions into video content. This can include generating scenes, animations, or complete videos based on the provided text. For example, given a story or script, the model can create a video that visually represents the narrative.

Text-to-Video. Source

In the Image-to-Video aspect, the task revolves around creating videos from a sequence of images. This could be used for creating animations, slideshow-style videos, or even videos with added effects based on the images provided.

I will use a model from Hugging Face to generate a video based on a text prompt. I recommend using Google Colab for this demonstration.

!pip install torch torchvision
!pip install diffusers
!pip install accelerate

To have access to files in the drive:

from google.colab import drive
drive.mount('/content/drive')

To use a Hugging Face model in Colab, define a secret key named HF_TOKEN and set your Hugging Face token as the value.

I2VGen-XL is an advanced diffusion model designed for producing videos at higher resolutions compared to SVD. It goes beyond image inputs by accommodating text prompts as well. The model employs dual hierarchical encoders — detail and global encoders — to capture both intricate and overarching features within images effectively. These acquired features play a key role in training a specialized video diffusion model, refining video quality and enhancing details in the generated content.

Reference Image. Source
import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import export_to_gif, load_image

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
pipeline.enable_model_cpu_offload()

image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
image = load_image(image_url).convert("RGB")

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = torch.manual_seed(8888)

frames = pipeline(
prompt=prompt,
image=image,
num_inference_steps=50,
negative_prompt=negative_prompt,
guidance_scale=9.0,
generator=generator
).frames[0]

An instance of I2VGenXLPipeline is created pre-trained on the "ali-vilab/i2vgen-xl" model. It specifies the use of 16-bit floating point numbers (FP16) for data, which reduces memory usage and can speed up computation, particularly on GPUs that support FP16.

The enable_model_cpu_offload() method is called to offload parts of the model to CPU, which is useful for managing GPU memory usage more efficiently.

export_to_gif(frames, "/content/drive/My Drive/Colab Notebooks/i2v.gif")

from IPython.display import Image
display(Image(filename="/content/drive/My Drive/Colab Notebooks/i2v.gif"))
Generated video.

Next:

Read More

Sources

https://huggingface.co/tasks/text-to-video

https://huggingface.co/docs/diffusers/main/en/using-diffusers/text-img2vid

--

--