This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.
ๆฌ้กน็ฎๅธๆ้่ฟๅผๆบ็คพๅบ็ๅ้ๅค็ฐSora๏ผ็ฑๅๅคง-ๅ ๅฑAIGC่ๅๅฎ้ชๅฎคๅ ฑๅๅ่ตท๏ผๅฝๅ็ๆฌ็ฆป็ฎๆ ๅทฎ่ทไป็ถ่พๅคง๏ผไป้ๆ็ปญๅฎๅๅๅฟซ้่ฟญไปฃ๏ผๆฌข่ฟPull request๏ผ็ฎๅไปฃ็ ๅๆถๆฏๆไฝฟ็จๅฝไบงAI่ฎก็ฎ็ณป็ป๏ผๅไธบๆ่ พ๏ผ่ฟ่กๅฎๆด็่ฎญ็ปๅๆจ็ใๅบไบๆ่ พ่ฎญ็ปๅบ็ๆจกๅ๏ผไนๅฏ่พๅบๆๅนณไธ็็่ง้ข่ดจ้ใ
- [2024.07.24] ๐ฅ๐ฅ๐ฅ v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
- [2024.05.27] ๐ We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
- [2024.04.09] ๐ค Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
- [2024.04.07] ๐๐๐ Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
- [2024.03.27] ๐๐๐ We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
- [2024.03.01] ๐ค We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch ๐ this repository for the latest updates.
93ร1280ร720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.
video_24fps_compress.mp4 |
Open-Sora Plan shows excellent performance in video generation.
- High compression ratio with excellent performance, capable of compressing videos by 256 times (4ร8ร8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.
- With a 3D full attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.
Highly recommend trying out our web demo by the following command.
python -m opensora.serve.gradio_web_server --model_path "path/to/model" --ae_path "path/to/causalvideovae"
Coming soon...
Version | Architecture | Diffusion Model | CausalVideoVAE | Data |
---|---|---|---|---|
v1.2.0 | 3D | 93x720p, 29x720p[1], 93x480p[1,2], 1x480p | Anysize | Annotations |
v1.1.0 | 2+1D | 221x512x512, 65x512x512 | Anysize | Data and Annotations |
v1.0.0 | 2+1D | 65x512x512, 65x256x256, 17x256x256 | Anysize | Data and Annotations |
[1] Please note that the weights for v1.2.0 29ร720p and 93ร480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
[2] We fine-tuned 3.5k steps from 93ร720p to get 93ร480p for community research use.
Warning
- Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
- Install required packages We recommend the requirements as follows.
- Python >= 3.8
- Pytorch >= 2.1.0
- CUDA Version >= 11.7
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
- Install optional requirements such as static type checking:
pip install -e '.[dev]'
The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.
Training Dataset
|โโsub_dataset1
|โโsub_sub_dataset1
|โโvideo1.mp4
|โโvideo2.mp4
......
|โโsub_sub_dataset2
|โโvideo3.mp4
|โโvideo4.mp4
......
|โโsub_dataset2
|โโvideo5.mp4
|โโvideo6.mp4
......
|โโvideo7.mp4
|โโvideo8.mp4
bash scripts/causalvae/train.sh
We introduce the important args for training.
Argparse | Usage |
---|---|
Training size | |
--num_frames |
The number of using frames for training videos |
--resolution |
The resolution of the input to the VAE |
--batch_size |
The local batch size in each GPU |
--sample_rate |
The frame interval of when loading training videos |
Data processing | |
--video_path |
/path/to/dataset |
Load weights | |
--model_config |
/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter. |
--pretrained_model_name_or_path |
A directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer |
--resume_from_checkpoint |
/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer. |
bash scripts/causalvae/rec_video.sh
We introduce the important args for inference.
Argparse | Usage |
---|---|
Ouoput video size | |
--num_frames |
The number of frames of generated videos |
--height |
The resolution of generated videos |
--width |
The resolution of generated videos |
Data processing | |
--video_path |
The path to the original video |
--rec_path |
The path to the generated video |
Load weights | |
--ae_path |
/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json |
Other | |
--enable_tilintg |
Use tiling to deal with videos of high resolution and long duration |
--save_memory |
Save memory to inference but lightly influence quality |
For evaluation, you should save the original video clips by using --output_origin
.
bash scripts/causalvae/prepare_eval.sh
We introduce the important args for inference.
Argparse | Usage |
---|---|
Ouoput video size | |
--num_frames |
The number of frames of generated videos |
--resolution |
The resolution of generated videos |
Data processing | |
--real_video_dir |
The directory of the original videos. |
--generated_video_dir |
The directory of the generated videos. |
Load weights | |
--ckpt |
/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config. |
Other | |
--enable_tilintg |
Use tiling to deal with videos of high resolution and long time. |
--output_origin |
Output the original video clips, fed into the VAE. |
Then, we begin to eval. We introduce the important args in the script for evaluation.
bash scripts/causalvae/eval.sh
Argparse | Usage |
---|---|
--metric |
The metric, such as psnr, ssim, lpips |
--real_video_dir |
The directory of the original videos. |
--generated_video_dir |
The directory of the generated videos. |
We use a data.txt
file to specify all the training data. Each line in the file consists of DATA_ROOT
and DATA_JSON
. The example of data.txt
is as follows.
/path/to/data_root_1,/path/to/data_json_1.json
/path/to/data_root_2,/path/to/data_json_2.json
...
Then, we introduce the format of the annotation json file. The absolute data path is the concatenation of DATA_ROOT
and the "path"
field in the annotation json file.
The format of image annotation file is as follows.
[
{
"path": "00168/001680102.jpg",
"cap": [
"xxxxx."
],
"resolution": {
"height": 512,
"width": 683
}
},
...
]
The format of video annotation file is as follows. More details refer to HF dataset.
[
{
"path": "panda70m_part_5565/qLqjjDhhD5Q/qLqjjDhhD5Q_segment_0.mp4",
"cap": [
"A man and a woman are sitting down on a news anchor talking to each other."
],
"resolution": {
"height": 720,
"width": 1280
},
"fps": 29.97002997002997,
"duration": 11.444767
},
...
]
bash scripts/text_condition/gpu/train_t2v.sh
We introduce some key parameters in order to customize your training process.
Argparse | Usage |
---|---|
Training size | |
--num_frames 61 |
To train videos of different durations, e.g, 29, 61, 93, 125... |
--max_height 640 |
To train videos of different resolutions |
--max_width 480 |
To train videos of different resolutions |
Data processing | |
--data /path/to/data.txt |
Specify your training data. |
--speed_factor 1.25 |
To accelerate 1.25x videos. |
--drop_short_ratio 1.0 |
Do not want to train on videos of dynamic durations, discard all video data with frame counts not equal to --num_frames |
--group_frame |
If you want to train with videos of dynamic durations, we highly recommend specifying --group_frame as well. It improves computational efficiency during training. |
Multi-stage transfer learning | |
--interpolation_scale_h 1.0 |
When training a base model, such as 240p (--max_height 240 , --interpolation_scale_h 1.0 ) , and you want to initialize higher resolution models like 480p (height 480) from 240p's weights, you need to adjust --max_height 480 , --interpolation_scale_h 2.0 , and set --pretrained to your 240p weights path (path/to/240p/xxx.safetensors). |
--interpolation_scale_w 1.0 |
Same as --interpolation_scale_h 1.0 |
Load weights | |
--pretrained |
This is typically used for loading pretrained weights across stages, such as using 240p weights to initialize 480p training. Or when switching datasets and you do not want the previous optimizer state. |
--resume_from_checkpoint |
It will resume the training process from the latest checkpoint in --output_dir . Typically, we set --resume_from_checkpoint="latest" , which is useful in cases of unexpected interruptions during training. |
Sequence Parallelism | |
--sp_size 8 --train_sp_batch_size 2 |
It means running a batch size of 2 across 8 GPUs (8 GPUs on the same node). |
Warning
We provide multiple inference scripts to support various requirements. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method EulerAncestralDiscrete
for sampling.
Inference on 93ร720p, we report speed on H100.
Size | 1 GPU | 8 GPUs (sp) |
---|---|---|
29ร720p | 420s/100step | 80s/100step |
93ร720p | 3400s/100step | 450s/100step |
If you only have one GPU, it will perform inference on each sample sequentially, one at a time.
bash scripts/text_condition/gpu/sample_t2v.sh
If you want to batch infer a large number of samples, each GPU will infer one sample.
bash scripts/text_condition/gpu/sample_t2v_ddp.sh
If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.
bash scripts/text_condition/gpu/sample_t2v_sp.sh
Coming soon...
Coming soon...
Coming soon...
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
- Latte: It is an wonderful 2+1D video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- See LICENSE for details.
@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
author = {PKU-Yuan Lab and Tuzhan AI etc.},
title = {Open-Sora-Plan},
month = apr,
year = 2024,
publisher = {GitHub},
doi = {10.5281/zenodo.10948109},
url = {https://doi.org/10.5281/zenodo.10948109}
}