Chenming Zhu
Tai Wang*
Wenwei Zhang
Jiangmiao Pang
Xihui Liu*
The University of Hong Kong Shanghai AI Laboratory
- Python >= 3.10
- Pytorch == 2.1.0
- CUDA Version >= 11.7
- Install required packages:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install git+https://github.com/facebookresearch/pytorch3d.git@stablegit@28fe037d212663c6a24f373b94cc5d478c8c1a1d
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
You can use the fine-tuned model to infrence on the scene rgbd video on a single GPU:
./inference.sh
Since current the batch_size
during inference is fixed to be 1
due to some reason. To accelereate the evaluation speed, we now provide the the script of multi-gpu inference on SLURM.
bash eval_multiprocess_mmscan_qa_slurm.sh
You can simply modify it to run on the local device (non-slurm environment).
Convert your data to a JSON file of a List of all samples. Sample metadata should contain id
(a unique identifier), video
(the path to the video), and conversations
(the conversation data between human and AI).
A sample JSON for finetuning LLaVA-3D for generating 3D scene caption.
[
{
"id": 0,
"video": "frames/scannet/scene0442_00",
"conversations": [
{
"from": "human",
"value": "<video>\nDescribe the room concisely."
},
{
"from": "gpt",
"value": "In the opulent living room, adorned with four chairs, four tables, and five armchairs, a symphony of elegance unfolds. The chairs, positioned in front of the tables, create an inviting space for conversation and relaxation. The tables, in turn, stand proudly behind the chairs, offering a surface for books, drinks, or cherished mementos. The armchairs, scattered throughout the room, beckon weary souls to sink into their plush embrace. This living room exudes comfort and sophistication, a sanctuary for both solitary contemplation and convivial gatherings."
}
]
},
...
]
- Release training and inference code.
- Release gradio demo.
- Release checkpoints and datasets.
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.