Affordance Grounding from Demonstration Video to Target Image

This repository is the official implementation of Affordance Grounding from Demonstration Video to Target Image:

@inproceedings{afformer,
  author  = {Joya Chen and Difei Gao and Kevin Qinghong Lin and Mike Zheng Shou},
  title   = {Affordance Grounding from Demonstration Video to Target Image},
  booktitle = {CVPR},
  year    = {2023},
}

Install

1. PyTorch

We now support PyTorch 2.0. Other version should be okay.

conda install -y pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia

NOTE: If you want to use PyTorch 2.0, you should install CUDA >= 11.7. See https://pytorch.org/.

2. PyTorch Lightning

We use PyTorch Lightning 2.0 as the training and inference engines.

pip install lightning jsonargparse[signatures] --upgrade

3. xFormers

We use memory-efficient attention in xformers. Currently PyTorch 2.0 does not support memory-efficient attention relative positional encoding (see pytorch/issues/96099). We will update this repo when PyTorch supports this.

pip install triton --upgrade
pip install --pre xformers

4. Timm, Detectron2, Others

We borrow some implementations from timm and detectron2.

pip install timm opencv-python av imageio --upgrade
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Dataset

Downloading OPRA dataset from https://github.com/kuanfang/opra. Due to the copyright issue, you may need to download the original video from YouTube.
We have uploaded our organized annotation json files to datasets/opra/annotations. Now your datasets file tree should be:

datasets
└── opra
    ├── annotations
    │   ├── test.json
    │   ├── train.json
    ├── clips
    │   ├── aocom
    │   ├── appliances
    │   ├── bestkitchenreview
    │   ├── cooking
    │   ├── eguru
    │   └── seattle
    └── images
        ├── aocom
        ├── appliances
        ├── bestkitchenreview
        ├── cooking
        ├── eguru
        └── seattle

We are working on organizing EPIC-Hotspot and AssistQ Buttons. They will be released as soon as possible.

Afformer Model

Hint: We recommend you to read LightningCLI if you firstly use it. That helps you better use these commands.

1. ResNet-50-FPN encoder

You dont need to manually download pre-trained encoder weight. torchvision will automatically download it. See torchvision.models.detection.fasterrcnn_resnet50_fpn_v2 for details.
Training Afformer with ResNet-50-FPN encoder with

python main.py fit --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2

The training log is saved in outputs/ by default. You can launch a tensorboard to monitor this folder:

tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/

The evaluation would be done each 1k iterations during training. You can also evaluate with the validate command. For example,

python main.py validate --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/r50fpn/lightning_logs/version_0/checkpoints/xxxx.ckpt

2. ViTDet-B encoder

Downloading ViTDet-B-COCO weights and then put it to weights/ folder: weights/mask_rcnn_vitdet_b_coco.pkl.
Training Afformer with ViTDet-B encoder with

python main.py fit --config configs/opra/vitdet.yaml --trainer.devices 8 --data.batch_size_per_gpu 2

The training log is saved in outputs/ by default. You can launch a tensorboard to monitor this folder:

tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/

The evaluation would be done each 1k iterations during training. You can also evaluate with the validate command. For example,

python main.py validate --config configs/opra/vitdet_b.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/vitdet_b/lightning_logs/version_0/checkpoints/xxxx.ckpt

3. Visualization

python demo.py --config configs/opra/vitdet_b.yaml --weight weights/afformer_vitdet_b_v1.ckpt --video demo/video.mp4 --image demo/image.jpg --output demo/output.gif

Hint: we carefully fine-tuned a very strong ViTDet model, which is better than paper reported. Download it.

MaskAHand Pre-training

NOTE: A detailed tutorial will be done as soon as possible.

1. Hand Interaction Detection

Downloading our trained hand interaction detector weights in this url. Then put it to weights/ folder: weights/hircnn_r50fpnv2_849.pth.
The video demo by this hand interaction detector:

Hint: we trained this simple and accurate hand interaction detector using 100DOH + some Ego datasets. It achieves 84.9 hand+interaction detection AP on 100DOH test set. For MaskAHand pre-training, this weight is enough. We will release its full source code at chenjoya/hircnn as soon as possible.

2. Hand Interaction Clip Mining

Make sure your data preparation follows Dataset part.
Running affominer/miner.py. The generated data will be saved at affominer/outputs.

3. Target Image Synthesis and Transformation

This would be done during training. You can set the hyper-parameters in configs/opra/maskahand/pretrain.yaml:

mask_ratio: 1.0
num_masks: 2
distortion_scale: 0.5
num_frames: 32
clip_interval: 16
contact_threshold: 0.99

4. MaskAHand Pre-training

python main.py fit --config configs/opra/maskahand/pretrain.yaml

5. Fine-tuning or Zero-shot Evaluation

Fine-tuning the MaskAHand pre-trained weight by

python main.py fit --config configs/opra/maskahand/finetune.yaml

Zero-shot evaluate the MaskAHand pre-trained weight by

python main.py validate --config configs/opra/maskahand/pretrain.yaml

6. Visualization

You can refer to demo.py to visualize your model results.

Contact

This repository is developed by Joya Chen. Questions and discussions are welcome via joyachen@u.nus.edu.

Acknowledgement

Thanks to all co-authors of the paper, Difei Gao, Kevin Qinghong Lin, and Mike Shou (my supervisor). Also appreciate the assistance from Dongxing Mao and Jiawei Liu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Affordance Grounding from Demonstration Video to Target Image

Install

1. PyTorch

2. PyTorch Lightning

3. xFormers

4. Timm, Detectron2, Others

Dataset

Afformer Model

1. ResNet-50-FPN encoder

2. ViTDet-B encoder

3. Visualization

MaskAHand Pre-training

1. Hand Interaction Detection

2. Hand Interaction Clip Mining

3. Target Image Synthesis and Transformation

4. MaskAHand Pre-training

5. Fine-tuning or Zero-shot Evaluation

6. Visualization

Contact

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
affodist		affodist
affominer		affominer
afformer		afformer
configs/opra		configs/opra
data		data
datasets/opra/annotations		datasets/opra/annotations
demo		demo
tools		tools
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
main.py		main.py
readme.png		readme.png

showlab/afformer

Folders and files

Latest commit

History

Repository files navigation

Affordance Grounding from Demonstration Video to Target Image

Install

1. PyTorch

2. PyTorch Lightning

3. xFormers

4. Timm, Detectron2, Others

Dataset

Afformer Model

1. ResNet-50-FPN encoder

2. ViTDet-B encoder

3. Visualization

MaskAHand Pre-training

1. Hand Interaction Detection

2. Hand Interaction Clip Mining

3. Target Image Synthesis and Transformation

4. MaskAHand Pre-training

5. Fine-tuning or Zero-shot Evaluation

6. Visualization

Contact

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages