Exploring Transformer and Multi Label Classification for Remote Sensing Image Captioning

Installation

The program requires the following dependencies:

pytorch
fairseq 0.9.0
CUDA (for using GPU)

Setup

We are using COCO Caption Evaluation library, which uses the Stanford CoreNLP 3.6.0 toolset

cd external/coco-caption
./get_stanford_models.sh
export PYTHONPATH=./external/coco-caption

Pre-procesing

Pre-process UC Merced images and captions

./preprocess_captions.sh uc-merced
./preprocess_images.sh uc-merced

Note

Add/Replace files to fairseq 0.9.0 from fairseq

Training

Hyperparameters need to be tuned. This is just an example.

python -m fairseq_cli.train \
  --save-dir .checkpoints \
  --user-dir task \
  --task captioning \
  --arch default-captioning-arch \
  --encoder-layers 3 \
  --decoder-layers 6 \
  --features obj \
  --feature-spatial-encoding \
  --optimizer adam \
  --adam-betas "(0.9,0.999)" \
  --lr 0.0003 \
  --lr-scheduler inverse_sqrt \
  --min-lr 1e-09 \
  --warmup-init-lr 1e-8 \
  --warmup-updates 8000 \
  --criterion label_smoothed_cross_entropy \
  --label-smoothing 0.1 \
  --weight-decay 0.0001 \
  --dropout 0.3 \
  --max-epoch 25 \
  --max-tokens 4096 \
  --max-source-positions 100 \
  --encoder-embed-dim 512 \
  --num-workers 2

Evaluation

Generate

To generate captions for images in test-split

python generate.py \
  --user-dir task \
  --features grid \
  --tokenizer moses \
  --bpe subword_nmt \
  --bpe-codes output/codes.txt \
  --beam 5 \
  --split test \
  --path .checkpoints-scst/checkpoint24.pt \
  --input output/test-ids.txt \
  --output output/test-predictions.json \
  --output_l output/test-labels-preds.csv

Scoring

The following example calculates metrics for captions contained in output/test-predictions.json.

./score.sh \
  --reference-captions external/coco-caption/annotations/captions_val2014.json \
  --system-captions output/test-predictions.json

The following example calculates metrics for labels contained in output/test-labels-preds.csv.

python score_label.py
  --reference-captions output/label_preds.csv \
  --system-captions output/test-labels-preds.csv

Model

The trained multi-task model for image captioning with multi-label classification can be downloaded from here

Results

Image	Caption
	Ground truth Caption: This is a part of a golf course with green turfs and some bunkers and trees . Caption w/o multi-label: green turfs and some bunkers and withered trees in the golf course. Caption with multi-label: this is a part of a golf course with green turfs and some bunkers and trees.
	Ground truth Caption: There are two tennis courts arranged neatly and surrounded by some plants . Caption w/o multi-label: four tennis courts arranged neatly with some plants surrounded. Caption with multi-label: there are two tennis courts arranged neatly and surrounded by some plants.
	Ground truth Caption: Two straight freeways parallel forward with some cars on them . Caption w/o multi-label: some cars are on the freeways. Caption with multi-label: two straight freeways closed to each other with some cars on them.
	Ground truth Caption: Two airplanes are stopped at the airport . Caption w/o multi-label: an airplane is stopped at the airport. Caption with multi-label: two airplanes are stopped at the airport.
	Ground truth Caption: Many mobile homes are closed to each other with some cars parked at the roadside in the mobile home park . Caption w/o multi-label: lots of mobile homes with plants surrounded in the mobile home park. Caption with multi-label: many houses arranged neatly with plants surrounded in the medium residential area.
	Ground truth Caption: An intersection with a road cross over the other roads . Caption w/o multi-label: an overpass go across the roads diagonally with lawn surounded. Caption with multi-label: an overpass with a road go across another roads diagonally with some cars on the roads.

Results from other models

Image	Caption
	Ground truth Caption: This is a part of a golf course with green turfs and some bunkers and trees . Caption with angle prediction: a part of a golf course with green turfs and some bunkers and a trail cross the turfs. Caption with reconstruction: this is a part of a golf course with green turfs and some trees.
	Ground truth Caption: There are two tennis courts arranged neatly and surrounded by some plants . Caption with angle prediction: there are six tennis courts arranged neatly and surrounded by some buildings. Caption with reconstruction: this is a sparse residential area with a villa surrounded by trees.
	Ground truth Caption: Two straight freeways parallel forward with some cars on them . Caption with angle prediction: two straight freeways with some cars on them. Caption with reconstruction: an overpass with a road go across another roads diagonally with some cars on the roads.
	Ground truth Caption: Two airplanes are stopped at the airport . Caption with angle prediction: it is a purple airplane stopped at the airport. Caption with reconstruction: an airplane is stopped at the airport and the ground is dark.
	Ground truth Caption: Many mobile homes are closed to each other with some cars parked at the roadside in the mobile home park . Caption with angle prediction: many houses arranged in lines in the dense residential area. Caption with reconstruction: lots of mobile homes with plants surrounded in the mobile home park.
	Ground truth Caption: An intersection with a road cross over the other roads . Caption with angle prediction: an overpass go across the roads with some cars on the roads. Caption with reconstruction: an overpass with a road go across another roads diagonally with some cars on it.

Reference

Codebase inspired from https://github.com/krasserm/fairseq-image-captioning

If you find this code useful for your research, please cite our paper:

@article{kandala2022exploring,
  title={Exploring Transformer and multi-label classification for remote sensing image captioning},
  author={Kandala, Hitesh and Saha, Sudipan and Banerjee, Biplab and Zhu, Xiao Xiang},
  journal={IEEE Geoscience and Remote Sensing Letters},
  year={2022},
  publisher={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
fairseq-image-captioning		fairseq-image-captioning
fairseq		fairseq
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Transformer and Multi Label Classification for Remote Sensing Image Captioning

Installation

Setup

Pre-procesing

Note

Training

Evaluation

Generate

Scoring

Model

Results

Results from other models

Reference

About

Releases

Packages

Languages

hiteshK03/Remote-sensing-image-captioning-with-transformer-and-multilabel-classification

Folders and files

Latest commit

History

Repository files navigation

Exploring Transformer and Multi Label Classification for Remote Sensing Image Captioning

Installation

Setup

Pre-procesing

Note

Training

Evaluation

Generate

Scoring

Model

Results

Results from other models

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages