Add CogVideoX text-to-video generation model #9082

zRzRzRzRzRzRzR · 2024-08-05T06:04:20Z

What does this PR do?

This PR converts the CogVideoX model into a diffuser-supported inference model, including a complete pipeline and corresponding modules. The paper is still in the process of being written, which may result in temporary omissions regarding the paper in the documentation.

Who can review?

@yiyixuxu
@stevhliu and @sayakpaul

docs/source/en/api/pipelines/cogvideox.md

sayakpaul · 2024-08-06T03:59:29Z

docs/source/en/api/pipelines/cogvideox.md

+Without torch.compile(): Average inference time: TODO seconds.
+With torch.compile(): Average inference time: TODO seconds.
+```
+


We can also include a tip section like we have in Flux:
https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux

This way users are aware of the optimizations that are possible.

We should probably also mention that users can benefit from context-parallel caching.

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

sayakpaul · 2024-08-06T04:01:39Z

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

+        hidden_states: torch.Tensor,
+        temb: Optional[torch.Tensor] = None,
+        zq: Optional[torch.Tensor] = None,
+        clear_fake_cp_cache: bool = False,


Curiosity: why is this fake?

Because this is executed serially on a single GPU.

Co-Authored-By: YiYi Xu <yixu310@gmail.com>

HuggingFaceDocBuilderDev · 2024-08-06T08:25:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/en/api/pipelines/cogvideox.md

a-r-r-o-w · 2024-08-06T13:43:39Z

@zRzRzRzRzRzRzR Just to notify, I have removed the clear_fake_cp_cache parameter and switched to something I thought was more clean and understandable. It should be consistent with the old implementation AFAICT. Now trying to debug why we use double the memory than the SAT implementation.

a-r-r-o-w · 2024-08-06T13:50:44Z

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

+        input_parallel = self.fake_cp_pass_from_previous_rank(inputs)
+
+        self._clear_fake_context_parallel_cache()
+        self.conv_cache = input_parallel[:, :, -self.time_kernel_size + 1 :].contiguous().detach().clone().cpu()


@zRzRzRzRzRzRzR Is there any reason why we're moving the latents to CPU here but then moving them back to inputs.device in fake_cp_pass_from_previous_rank? Since this implementation is tailored to single GPU, and since I don't think the latents will reside on multiple devices for our implementation, we can get rid of the .contiguous().detach().clone().cpu() part and just keep them on same device. cc @yiyixuxu

Edit: Tried it and it looks like .contiguous().clone() will be required but don't think we would need to move to CPU here for single GPU case

yiyixuxu

let's merge this soon!

yiyixuxu · 2024-08-06T14:25:37Z

src/diffusers/models/normalization.py

+
+        if self.chunk_dim == 1:
+            # This is a bit weird why we have the order of "shift, scale" here and "scale, shift" in the
+            # other if-branch. This branch is specific to CogVideoX for now.


@a-r-r-o-w we normally just adjust the weights (ok to keep this! I don't think we need to update the weights now just FYI ) https://github.com/huggingface/diffusers/blob/main/scripts/convert_sd3_to_diffusers.py#L37

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

stevhliu

Great work, thanks!

docs/source/en/api/pipelines/cogvideox.md

stevhliu · 2024-08-06T16:11:00Z

docs/source/en/api/pipelines/cogvideox.md

+
+First, load the pipeline:
+
+```python


Do we need to include this code block to demonstrate torch.compile, or is it to show inference time without torch.compile? If it's not necessary, I'm more in favor of just showing the below to keep it simpler.

# create pipeline pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.bfloat16).to("cuda") # set to channels_last pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) # compile pipeline.transformer = torch.compile(pipeline.transformer) pipeline.vae.decode = torch.compile(pipeline.vae.decode) # inference prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] export_to_video(video, "output.mp4", fps=8)

yiyixuxu · 2024-08-06T22:20:05Z

i made a PR to update scheduler config on the hub, https://huggingface.co/THUDM/CogVideoX-2b/discussions/2 we can merge that after this PR is merged here

you can test this pr with revision="refs/pr/2, here is the script I used to run tests on both scehdulers with and without dynamic scheduler

from diffusers.utils import export_to_video
import torch
import numpy as np
import PIL

import tempfile
import imageio

def export_to_video_imageio(video_frames, output_video_path: str = None, fps: int = 8):
    """
    Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
    """
    if output_video_path is None:
        output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4").name

    if isinstance(video_frames[0], PIL.Image.Image):
        video_frames = [np.array(frame) for frame in video_frames]

    with imageio.get_writer(output_video_path, fps=fps) as writer:
        for frame in video_frames:
            writer.append_data(frame)

    return output_video_path



prompts = [
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance.",
    "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.",
    "The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.",
    "A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.",
    "In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict."
    ]


pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16, revision="refs/pr/2")
pipe.enable_model_cpu_offload()

for prompt in prompts:
        for seed in [3]:
                # test ddim
                pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config)
                assert pipe.scheduler.config._class_name == "CogVideoXDDIMScheduler" and pipe.scheduler.config.timestep_spacing=="trailing"

                generator= torch.Generator(device="cpu").manual_seed(seed)
                video = pipe(prompt, guidance_scale=6, num_inference_steps=50, generator=generator).frames[0]
                export_to_video_imageio(video, f"{prompt[:10]}_{seed}_ddim.mp4", fps=8)


                assert pipe.scheduler.config._class_name == "CogVideoXDDIMScheduler" and pipe.scheduler.config.timestep_spacing=="trailing"
                generator= torch.Generator(device="cpu").manual_seed(seed)
                video = pipe(prompt, guidance_scale=6, num_inference_steps=50, generator=generator, use_dynamic_cfg=True).frames[0]
                export_to_video_imageio(video, f"{prompt[:10]}_{seed}_ddim_dynamic_cfg.mp4", fps=8)

                # test dpm
                pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
                assert pipe.scheduler.config._class_name == "CogVideoXDPMScheduler" and pipe.scheduler.config.timestep_spacing=="trailing"

                generator= torch.Generator(device="cpu").manual_seed(seed)
                video = pipe(prompt, guidance_scale=6, num_inference_steps=50, generator=generator).frames[0]
                export_to_video_imageio(video, f"{prompt[:10]}_{seed}_dpm.mp4", fps=8)
                
                
                assert pipe.scheduler.config._class_name == "CogVideoXDPMScheduler" and pipe.scheduler.config.timestep_spacing=="trailing"
                generator= torch.Generator(device="cpu").manual_seed(seed)
                video = pipe(prompt, guidance_scale=6, num_inference_steps=50, generator=generator, use_dynamic_cfg=True).frames[0]
                export_to_video_imageio(video, f"{prompt[:10]}_{seed}_dpm_dynamic_cfg.mp4", fps=8)

docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

sayakpaul · 2024-08-07T07:02:49Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

+            if prompt is not None and type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )


This should have ideally gone to check_inputs().

Ah cool, then we should do it here too:

diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py

Line 454 in 2dad462

if prompt is not None and type(prompt) is not type(negative_prompt):

as it was copied over from there

sayakpaul · 2024-08-07T07:03:30Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

+        assert (
+            num_frames <= 48 and num_frames % fps == 0 and fps == 8
+        ), f"The number of frames must be divisible by {fps=} and less than 48 frames (for now). Other values are not supported in CogVideoX."


Should have been raised as a ValueError.

sayakpaul · 2024-08-07T07:04:03Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

+
+        # 5. Prepare latents.
+        latent_channels = self.transformer.config.in_channels
+        num_frames += 1


It wouldn't hurt to have a comment explaining why this addition is required.

zRzRzRzRzRzRzR and others added 30 commits July 30, 2024 13:15

Create autoencoder_kl3d.py

c8e5491

vae draft

c341786

initial draft of cogvideo transformer

bd6efd5

add imports

bb91775

fix attention mask

59e6669

fix layernorms

45cb1f9

fix with some review guide

84ff56e

rename

a3d827f

fix error

dc7e6e8

Update autoencoder_kl3d.py

aff72ec

fix nasty bug in 3d sincos pos embeds

cb5348a

refactor

e982881

update conversion script for latest modeling changes

d963b1a

remove debug prints

1696758

make style

21a0fc1

add workflow to rebase with upstream main nightly.

d83c1f8

add upstream

dfeb329

Revert "add workflow to rebase with upstream main nightly."

71bcb1e

add workflow for rebasing with upstream automatically.

0980f4d

follow review guide

ee40f0e

add

8fe54bc

remove deriving and using nn.module

1c661ce

Merge branch 'cogvideox' into cogvideox-common-draft-1

73b041e

add skeleton for pipeline

b305280

make fix-copies

6bcafcb

Merge branch 'main' into cogvideox-common-draft-2

ec9508c

undo unnecessary changes added on cogvideo-vae by mistake

3ae9413

groups->norm_num_groups

2be7469

verify CogVideoXSpatialNorm3D implementation

9f9d0cb

minor factor and repositioning of code in order of invocation

c43a8f5

sayakpaul reviewed Aug 6, 2024

View reviewed changes

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

sayakpaul reviewed Aug 6, 2024

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py Show resolved Hide resolved

sayakpaul reviewed Aug 6, 2024

View reviewed changes

a-r-r-o-w and others added 3 commits August 6, 2024 07:02

address review comments

de9e0b2

Co-Authored-By: YiYi Xu <yixu310@gmail.com>

dynamic cfg; fix cfg support

9c086f5

Co-Authored-By: YiYi Xu <yixu310@gmail.com>

address review comments

62d94aa

Co-Authored-By: YiYi Xu <yixu310@gmail.com>

sayakpaul mentioned this pull request Aug 6, 2024

Training example with FP8? sayakpaul/diffusers-torchao#5

Closed

a-r-r-o-w added 3 commits August 6, 2024 10:07

update tests

5e4dd15

Merge branch 'main' into cogvideox-2b

884ddd0

fix docs error

d1c575a

sayakpaul reviewed Aug 6, 2024

View reviewed changes

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

alternative implementation to context parallel cache

11224d9

a-r-r-o-w reviewed Aug 6, 2024

View reviewed changes

yiyixuxu approved these changes Aug 6, 2024

View reviewed changes

Update docs/source/en/api/pipelines/cogvideox.md

70cea91

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

stevhliu approved these changes Aug 6, 2024

View reviewed changes

remove tiling and slicing until their implementations are complete

cbc4d32

yiyixuxu reviewed Aug 6, 2024

View reviewed changes

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

sayakpaul and others added 3 commits August 7, 2024 07:49

Merge branch 'main' into cogvideox-2b

14698d0

Merge branch 'main' into cogvideox-2b

8be845d

Apply suggestions from code review

827a70a

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

sayakpaul reviewed Aug 7, 2024

View reviewed changes

yiyixuxu merged commit 2dad462 into huggingface:main Aug 7, 2024
15 checks passed

a-r-r-o-w mentioned this pull request Aug 11, 2024

[refactor] CogVideoX followups + tiled decoding support #9150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CogVideoX text-to-video generation model #9082

Add CogVideoX text-to-video generation model #9082

zRzRzRzRzRzRzR commented Aug 5, 2024

sayakpaul Aug 6, 2024

sayakpaul Aug 6, 2024

sayakpaul Aug 6, 2024

zRzRzRzRzRzRzR Aug 6, 2024

HuggingFaceDocBuilderDev commented Aug 6, 2024

a-r-r-o-w commented Aug 6, 2024

a-r-r-o-w Aug 6, 2024 •

edited

Loading

yiyixuxu left a comment

yiyixuxu Aug 6, 2024

stevhliu left a comment

stevhliu Aug 6, 2024

yiyixuxu commented Aug 6, 2024

sayakpaul Aug 7, 2024

a-r-r-o-w Aug 7, 2024

sayakpaul Aug 7, 2024

sayakpaul Aug 7, 2024

Add CogVideoX text-to-video generation model #9082

Add CogVideoX text-to-video generation model #9082

Conversation

zRzRzRzRzRzRzR commented Aug 5, 2024

What does this PR do?

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 6, 2024

a-r-r-o-w commented Aug 6, 2024

a-r-r-o-w Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

yiyixuxu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiyixuxu commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-r-r-o-w Aug 6, 2024 •

edited

Loading