TextKG

Text with Knowledge Graph Augmented Transformer for Video Captioning

Official code for Text with Knowledge Graph Augmented Transformer for Video Captioning.

Xin Gu, Guang Chen, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen

Accepted by CVPR2023

Introduction

Existing video captioning methods generally have long tail problems. We present TextKG, a knowledge graph (KG) augmented transformer for video captioning, which integrates external knowledge and exploits multi-modality information in videos to address the challenge of long-tail words.

Approach

Knowledge Graphs Construction

General knowledge graph (G-KG) is designed to include the most key�information in general scenarios in which we are interested, such�as cooking and activity. It is built from the public available giant knowledge graph ConceptNet by extracting keywords in ConceptNet with the connected edges and neighboring nodes.
Specific knowledge graph (S-KG) is built to cover key information in specific scenarios. We extract speech transcripts from videos using an automatic speech recognition (ASR) model. We gather phrases such as “adjective and noun”, "noun and noun”, and "adverb and verb" to construct the S-KG.

Two-Stream Transformer

Our approach comprises an external stream that can utilize external knowledge information and an internal stream that can leverage the multimodal information from the video.

Architecture

Figure 1.TextKG Network

Usage

Our proposed TextKG is implemented with PyTorch.

Environment

Python = 3.7
PyTorch = 1.4
pycocoevalcap

1.Installation

Clone this repo:

git clone https://github.com/GX77/TextKG.git
cd TextKG

2.Download datasets

YouCooKII

Training & Testing

YouCooKII

# Training
python3 train.py --res_root_dir YOUR_DIR --dset_name yc2

# Test
python3 translate.py --res_dir YOUR_DIR

We will add other datasets later.

Citation

If our research and this repository are helpful to your work, please cite with:

@InProceedings{Gu_2023_CVPR,
    author    = {Gu, Xin and Chen, Guang and Wang, Yufei and Zhang, Libo and Luo, Tiejian and Wen, Longyin},
    title     = {Text With Knowledge Graph Augmented Transformer for Video Captioning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {18941-18951}
}

Acknowledge

Code of the decoding part is based on MART.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
densevid_eval		densevid_eval
figures		figures
src		src
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextKG

Introduction

Approach

Knowledge Graphs Construction

Two-Stream Transformer

Architecture

Usage

Environment

1.Installation

2.Download datasets

Training & Testing

YouCooKII

Citation

Acknowledge

About

Releases

Packages

Languages

GX77/TextKG

Folders and files

Latest commit

History

Repository files navigation

TextKG

Introduction

Approach

Knowledge Graphs Construction

Two-Stream Transformer

Architecture

Usage

Environment

1.Installation

2.Download datasets

Training & Testing

YouCooKII

Citation

Acknowledge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages