Skip to content

ADL-X/LLAVIDAL

Repository files navigation

LLAVIDAL 🏃👩‍🦯‍➡️🗨️

LLAVIDAL: Benchmarking Large LAnguage VIsion Models for Daily Activities of Living

* Equally contributing first authors

Affiliations:

1 University of North Carolina at Charlotte
2 Inria, Université Côte d’Azur

This codebase is adapted from Video-ChatGPT.

News 📢


LLAVIDAL Overview 👁️:

LLAVIDAL (Large LAnguage VIsion model for Daily Activities of Living) is a multimodal model designed to understand and generate meaningful conversations about activities of daily living (ADL) performed by humans in videos. Its architecture integrates multiple modalities, including video, 3D human poses, and object interaction cues, with a large language model (LLM). Here's an overview of LLAVIDAL's Approach:

LLAVIDAL Approach Overview

Contributions ⭐

We introduce ADL-X, the first multiview RGBD instruction ADL dataset, curated through a novel semi-automated framework for training LLVMs.

• LLAVIDAL is introduced as the first LLVM tailored for ADL, incorporating 3D poses and object cues into the embedding space of the LLM.

• A new benchmark, ADLMCQ, is proposed for an objective evaluation of LLVMs on ADL tasks, featuring MCQ tasks for action recognition & forecasting.

• Exhaustive experiments are conducted to determine the optimal strategy for integrating poses or objects into LLAVIDAL. Evaluation of existing LLVMs on ADLMCQ and video description tasks reveals that LLAVIDAL trained on ADL-X significantly outperforms baseline LLVMs


LLAVIDAL Architecture ⚙️

LLAVIDAL Architecture Overview

Overview of LLAVIDAL, which utilizes an LLM to integrate multiple modalities, including video, pose, and object features. Videos are represented by embeddings obtained from a VLM, poses are processed through (PoseLM), and object embeddings are obtained through (ObjectLM). These embeddings are projected into the LLM space, where they are concatenated with tokenized text queries for instruction tuning.


Installation 🔧

Our python environement is identical to Video-ChatGPT, we recommend following their installation instructions:

conda create --name=llavidal python=3.10
conda activate llavidal

git clone https://github.com/ADL-X/LLAVIDAL.git
cd LLAVIDAL
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, if you are using A100/H100 GPUs you can install FlashAttention,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo 🚗

To run the LLAVIDAL demo on your local GPU machine, please adhere to the following steps. Keep in mind that the demo requires around 18 GB of GPU memory.

  1. Follow the installation instructions above
  2. Download the LLAVIDAL weights from the following link
  3. Download LLaVa weights from this link

Finally, run the demo by executing the following command:

python llavidal/demo/video_demo.py \
    --model-name <path to the LLaVA-7B-Lightening-v1-1 weights downloaded in step 3> \
    --projection_path <path to the downloaded llavidal weights downloaded in step 2>

After running the command, follow the on-screen instructions to access the demo dashboard.

Training 💪🦾

We train LLAVIDAL model on our 100K video instruction dataset. We initialize the training from LLaVA. Please follow the instructions below to train LLAVIDAL-7B model. Prepare LLaVA weights LLAVIDAL is build using LLaVA. Please follow the following instructions of VideoChatGPT to get LLaVA weights.

Get the original LLaMA weights in the Hugging Face format. Use the following scripts to get LLaVA weights by applying our delta.

python scripts/apply_delta.py \ 
        --base-model-path <path to LLaMA 7B weights> \
        --target-model-path LLaVA-Lightning-7B-v1-1 \
        --delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1

The above command will download the LLaVA-Lightening-7B-v1-1 delta from Hugging Face, apply it to the provided LLaMA weights and save the LLaVA-Lightening-7B-v1-1 weights in the current directory. Alternatively you can download the ready LLaVA-Lightening-7B weights from mmaaz60/LLaVA-Lightening-7B-v1-1 Prepare Dataset

  1. Download our ADLX dataset video features. or Curate the dataset by following the steps in [[Video Instruction Dataset]].
  2. Convert the downloaded NTU_QA.json into the required format for training,
python scripts/convert_instruction_json_to_training_format.py \
        --input_json_file <path to json file downloaded in step 2> \
        --output_json_file llavidal_training.json
The above script will generate llavidal_training.json required to train our model.
  1. Prepare Spatio-Temporal features using CLIP Note that for training efficiency, we pre-computed the video spatio-temporal features and use them directly during training. After downloading the videos, please use the following command to generate CLIP spatio-temporal features.
python scripts/save_spatio_temporal_clip_features.py \
       --video_dir_path <path to the directory containing all the videos> \
       --clip_feat_path <The output dir to save the features in.>

The script will generate the spatiotemporal features for each video and save one pickle file per video in directory specified by --clip_feat_path argemunt. Alternatively, you can download the pre-computed spatiotemporal CLIP features from here.

  1. We are providing object features, pose features which are used as additional cues in the training. Which can be downloaded from here. We use the object features as our final model as it shows superior capabilities through our evaluation metrics.

  2. Train LLAVIDAL We have trained on 8 A6000 40GB GPUs using the following command,

torchrun --nproc_per_node=8 --master_port 29001 llavidal/train/train_mem.py \
          --model_name_or_path <path to LLaVA-7B-Lightening-v-1-1 model> \
          --version v1 \
          --data_path <path to the llavidal using `convert_instruction_json_to_training_format.py` script.> \
          --video_folder <path to the spatio-temporal features generated in step 4 using `save_spatio_temporal_clip_features.py` script> \
          --object_folder <path to the downloaded object features>/
          --tune_mm_mlp_adapter True \
          --mm_use_vid_start_end \
          --bf16 True \
          --output_dir ./LLAVIDAL_7B-1.1_Checkpoints \
          --num_train_epochs 3 \
          --per_device_train_batch_size 4 \
          --per_device_eval_batch_size 4 \
          --gradient_accumulation_steps 1 \
          --evaluation_strategy "no" \
          --save_strategy "steps" \
          --save_steps 3000 \
          --save_total_limit 3 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type "cosine" \
          --logging_steps 100 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --lazy_preprocess True

You can change the object features to pose features and change one line in the code to pass train_pose.py and llavidal_pose.py in train_mem.py. Similarly, for both object and pose features use train_pose_object.py and llavidal_pose_object.py. Pass the object and pose path together in that case.


Video Instruction Dataset :📂

We are introducing ADLX the first ADL centric video instruction dataset, due to licensing restrictions we cannot share the original videos but we are providing the video features of our dataset,we are also providing the object features and the pose features.

The dataset is in LINK. The folders are Video_features , Pose Features and Object Features

If you want to recreate our dataset curation pipeline you can do so in the following steps:

Step 1: Download NTURGBD dataset,follow the steps to get the dataset.

Step 2: Download the action combination list we created ACTION LIST.

Step 3: Arrange the NTU videos in Performer folders like P001,P002, etc

Step 4: Run the code,

python /data_annotation/process_video_sequences.py

and pass the action combination list and video folder paths.

Step 5: Download and setup CogVLM. Follow the instructions to deploy the huggingface version to get frame-level annotations at 0.5fps. Run the command of the CogVLM demo,

python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --quant 4

Step 6: Get dense descriptions from GPT 3.5 Turbo using command,

python /data_annotation/generate_dense_descriptions.py

Pass the appropiate paths of the files and your OPENAI api key.

Step 6: Get QA pairs by running command,

python /data_annotation/generate_QA_pairs.py

Pass the previous made dense captions here and your OPENAI api key.

Alternatively you can access our TRAINING_DATA here if you want to skip the above process. We have provided both jsons, the final json that would be used for training is instruction_converted_training_data.json or else you can follow scripts to convert it yourself the NTU_QA.json to instruction data.

You can adapt the above process for your own ADL dataset curation with any ADL data just create your own action combinations like that of STEP 2.

It is important to note we preprocessed our data to have person-centric cropping using Poses.

We highlight in our paper, why person-centric cropping is necessary for ADL Instruction Data Curation


Quantitative Evaluation 🧪

We introduce two new evaluation for ADL centric tasks -- ADLMCQ-AR & ADLMCQ-AF which are MCQs conttaining Action Recognition and Action Forecasting Tasks. We also release SmartHome Untrimmed Descriptions for the first time.

Step 1: Download all the datasets-- Charades , LEMMA(We use the exo-view) ,SMARTHOME UNTRIMMED and TRIMMED.

Step 2: For Action Forecasting access the json files and slice the videos from the start frame and end frame.For action recognition nothing is needed.

Step 3: Arrange the data like that in the json file provided and run the command ,

cd llavidal/eval/
python run_inference_action_recognition_charades.py
--video_dir /path/to/videos \
  --qa_file /path/to/qa_file.json \
  --output_dir /path/to/output \
  --output_name results \
  --model-name <LLAVA model path> \
  --conv-mode llavidal_v1 \
  --projection_path <path to LLAVIDAL WEIGHTS> 

Step 3: Evaulate using GPT3.5 Turbo api

cd quantitative_evaluation/
evaluate_action_recognition_charades.py

and pass the above results in STEP 2.

For other methods the above steps are same


For video descriptions for Charades run command

cd llavidal/eval
python run_inference_benchmark_general.py

Pass the appropiate paths to get the results josn

For video descriptions for Smarthome Untrimmed ,slice the videos in 1 minutes each and make a dense description like that of data curation process.

To get individual descriptions

cd llavidal/eval
python run_inference_descriptions_smarthome.py

We closely follow the MEMENTOS EVALUATION to get the object and action F1 scores

We provide a notebook to achieve the execute the above approach. Follow this notebook to get the evaluation

cd quantitative_evaluation/mementos_evaluation.ipynb

Qualitative Analysis 🎬

Qualitative Evaluation


Acknowledgements 🙏

  • LLaMA: Great step towards bridging vision and language!
  • VideoChatgpt: We thank for the foundational work.
  • LLAVA : For inspiring the overall architecture
  • CogVLM: For creating a strong captioning model.

If you're using LLAVIDAL in your research or applications, please cite using this BibTeX:

@misc{chakraborty2024llavidal,
      title={LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living}, 
      author={Rajatsubhra Chakraborty and Arkaprava Sinha and Dominick Reilly and Manish Kumar Govind and Pu Wang and Francois Bremond and Srijan Das},
      year={2024},
      eprint={2406.09390},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}

Usage LICENSE :

The dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.