VideoDirectorGPT: Consistent Multi-Scene
Video Generation via LLM-Guided Planning

Han Lin , Abhay Zala , Jaemin Cho , Mohit Bansal

UNC Chapel Hill

COLM 2024

Abstract

Although recent text-to-video (T2V) generation methods have seen significant advancements, the majority of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation?

In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, named Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across multiple scenes, while being only trained with image-level annotations. Our experiments demonstrate that our proposed VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene text-to-video generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images.

We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

Summary

Figure 1: Illustration of our two-stage framework for long, multi-scene video generation from text. In the first stage, we employ the LLM as a video planner to craft a video plan, which provides an overarching plot for videos with multiple scenes, guiding the downstream video generation process. The video plan consists of scene-level text descriptions, a list of the entities and background involved in each scene, frame-by-frame entity layouts (bounding boxes), and consistency groupings for entities and backgrounds. In the second stage, we utilize Layout2Vid, a grounded video generation module, to render videos based on the video plan generated in the first stage. This module uses the same image and text embeddings to represent identical entities and backgrounds from video plan, and allows for spatial control over entity layouts through the Guided 2D Attention in the spatial attention block.

Video Planning: Generating Video Plans with LLMs

As illustrated in the blue part of Figure 1, GPT-4 (OpenAI, 2023) acts as a planner and provides a detailed video plan from a single text prompt to guide the downstream video generation. Our video plan consists of four components: (1) multi-scene descriptions: a sentence describing each scene, (2) entities: names along with their 2D bounding boxes, (3) background: text description of the location of each scene, and (4) consistency groupings: scene indices for each entity/background indicating where they should remain visually consistent.

In the first step, we use GPT-4 to expand a single text prompt into a multi-scene video plan. Each scene comes with a text description, a list of entities (names and their 2D bounding boxes), and a background. For this step, we construct the input prompt using the task instruction, one in-context example, and the input text from which we aim to generate a video plan. Subsequently, we group entities and backgrounds that appear across different scenes using an exact match. For instance, if the 'chef' appears in scenes 1-4 and 'oven' only appears in scene 1, we form the entity consistency groupings as {chef:[1,2,3,4], oven:[1]}. In the subsequent video generation stage, we use the shared representations for the same entity/background consistency groups to ensure they maintain temporally consistent appearances.

In the second step, we expand the detailed layouts for each scene using GPT-4. We generate a list of bounding boxes for the entities in each frame based on the list of entities and the scene description. For each scene, we produce layouts for 8 frames, then linearly interpolate the bounding boxes to gather bounding box information for denser frames (e.g., 16 frames). We utilize the [x0 , y0 , x1 , y1 ] format for bounding boxes, where each coordinate is normalized to fall within the range [0,1]. For in-context examples, we present 0.05 as the minimum unit for the bounding box, equivalent to a 20-bin quantization over the [0,1] range.

Video Generation: Generating Videos from Video Plans with Layout2Vid

Figure 2: Overview of (a) spatio-temporal blocks within the diffusion UNet of our Layout2Vid and (b) Guided 2D Attention present in the spatial attention module. (a) The spatio-temporal block comprises four modules: spatial convolution, temporal convolution, spatial attention, and temporal attention. In (b) Guided 2D Attention, we modulate the visual representation with layout tokens and text tokens. For efficient training, only the parameters of the Guided 2D Attention (indicated by the fire symbol, constituting 13% of total parameters) are trained using image-level annotations. The remaining modules in the spatio-temporal block are kept frozen.

Our Layout2Vid module enables layout-guided video generation with explicit spatial control over a list of entities. These entities are represented by their bounding boxes, as well as visual and text content. As depicted in Fig. 2, we build upon the 2D attention mechanism within the spatial attention module of the spatio-temporal blocks in the Diffusion UNet to create the Guided 2D Attention. The Guided 2D Attention takes two conditional inputs to modulate the visual latent representation: (a) layout tokens, conditioned with gated self-attention, and (b) text tokens that describe the current scene, conditioned with cross-attention. Note that we train the Layout2Vid module in a parameter and data-efficient manner by only updating the Guided 2D Attention parameters (while other parameters remain frozen) with image-level annotations (no video-level annotations).

To preserve the identity of entities appearing across different frames and scenes, we use shared representations for the entities within the same consistency group. While previous layout-guided text-to-image generation models commonly only used the CLIP text embedding for layout control, we use the CLIP image embedding in addition to the CLIP text embedding for entity grounding.

We conduct experiments on both single-scene and multi-scene video generation. For single-scene video generation, we evaluate layout control via VPEval Skill-based prompts, assess object dynamics through ActionBench-Direction prompts adapted from ActionBench-SSV2, and examine open-domain video generation using the MSR-VTT dataset. For multi-scene video generation, we experiment with two types of input prompts: (1) a list of sentences describing events — ActivityNet Captions and Coref-SV prompts based on Pororo-SV, and (2) a single sentence from which models generate multi-scene videos — HiREST.

In addition, we show generated videos from text-only using Karlo and image+text with user-provided images.

Coref-SV

Scene 1: mouse is holding a book and makes a happy face.
Scene 2: he looks happy and talks.
Scene 3: he is pulling petals off the flower.
Scene 4: he is ripping a petal from the flower.
Scene 5: he is holding a flower by his right paw.
Scene 6: one paw pulls the last petal off the flower.
Scene 7: he is smiling and talking while holding a flower on his right paw.

ModelScopeT2V

VideoDirectorGPT (Ours)

Video generation examples on a Coref-SV prompt. Our video plan's object layouts (overlaid) can guide the Layout2Vid module to generate the same mouse and flower across scenes consistently, whereas ModelScopeT2V loses track of the mouse right after the first scene, generating a human hand and a dog instead of a mouse, and the flower changes color.

Scene 1: it's snowing outside.
Scene 2: dog is singing and dancing.
Scene 3: its friends are encouraging it to do something.
Scene 4: its friends are applauding at it.
Scene 5: it is bowing to the audience after the performance.

ModelScopeT2V

VideoDirectorGPT (Ours)

Video generation examples on a Coref-SV prompt. Our video plan's object layouts (overlaid) can guide the Layout2Vid module to generate the same brown dog and maintain snow across scenes consistently, whereas ModelScopeT2V generates different dogs in different scenes and loses the snow after the first scene.

HiREST

make caraway cakes

ModelScopeT2V

VideoDirectorGPT (Ours)

Comparison of generated videos on a HiREST prompt. Our model is able to generate a detailed video plan that properly expands the original text prompt to show the process, has accurate object bounding box locations (overlaid), and maintains the consistency of the person across the scenes. ModelScopeT2V only generates the final caraway cake and that cake is not consistent between scenes.

make strawberry surprise

ModelScopeT2V

VideoDirectorGPT (Ours)

Comparison of generated videos on a HiREST prompt. Our VideoDirectorGPT generates a detailed video plan that properly expands the original text prompt, ensures accurate object bounding box locations (overlaid), and maintains the consistency of the person across the scenes. ModelScopeT2V only generates strawberries.

ActionBench-Direction prompts

pushing stuffed animal from left to right

ModelScopeT2V

VideoDirectorGPT (Ours)

pushing pear from right to left

ModelScopeT2V

VideoDirectorGPT (Ours)

Video generation examples on ActionBench-Direction prompts. Our video plan's object layouts (overlaid) can guide the Layout2Vid module to place and move the 'stuffed animal' and 'pear' in their correct respective directions, whereas the objects in the ModelScopeT2V videos stay in the same location or move in random directions.

VPEval Skill-based prompts

a pizza is to the left of an elephant

ModelScopeT2V

VideoDirectorGPT (Ours)

four frisbees

ModelScopeT2V

VideoDirectorGPT (Ours)

Video generation examples on VPEval Skill-based prompts for spatial and count skills. Our video plan, with object layouts overlaid, successfully guides the Layout2Vid module to place objects in the correct spatial relations and to depict the correct number of objects, whereas ModelScopeT2V fails to generate 'pizza' in the first example and overproduces the number of frisbees in the second example.

User-Provided Input Image → Video

Scene 1: a <S> then gets up from a plush beige bed.
Scene 2: a <S> goes to the cream-colored kitchen and eats a can of gourmet cat snack.
Scene 3: a <S> sits next to a large floor-to-ceiling window.

Input

<S> = "white cat"

Generated Gif

Input

<S> = "cat"

Generated Gif

Input

<S> = "cat"

Generated Gif

Input

<S> = "teddy bear"

Generated Gif

Video generation examples with custom entities. Users can flexibly provide either text-only (1st row) or image+text (2nd to 4th rows) descriptions to place custom entities when generating videos with VideoDirectorGPT. For both text and image+text based entity grounding examples, the identities of the provided entities are well preserved across multiple scenes.

Human-in-the-Loop Editing

Original Gif

Human Edit

Make the horse smaller

Original Gif

Human Edit

Add "grassland" background

Original Gif

Human Edit

Add "night street" background

Video generation examples for human-in-the-loop editing. Users can modify the video plan (e.g., add/delete objects, change the background and entity layouts, etc.) to generate customized video contents. Given the same text prompt "A horse running", we provide visualizations with a smaller horse and different backgrounds (i.e., "night street" and "grassland").

Limitations

Our VideoDirectorGPT framework is for research purposes and is not intended for commercial use (and therefore should be used with caution in real-world applications, with human supervision).

BibTeX


@inproceedings{Lin2023VideoDirectorGPT,
        author = {Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal},
        title = {VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning},
        year = {2024},
        booktitle = {COLM},
}

VideoDirectorGPT: Consistent Multi-SceneVideo Generation via LLM-Guided Planning

Abstract

Summary

Video Planning: Generating Video Plans with LLMs

Video Generation: Generating Videos from Video Plans with Layout2Vid

Generated Examples

Coref-SV

HiREST

ActionBench-Direction prompts

VPEval Skill-based prompts

User-Provided Input Image → Video

Human-in-the-Loop Editing

Limitations

BibTeX

VideoDirectorGPT: Consistent Multi-Scene
Video Generation via LLM-Guided Planning