We provide a collection of baselines for the high-level captioning task, consisting in generating scene, action and rationale description for the image and baselines of the narrative captiong generation task using the HL-Narrative Dataset an extension of the HL Dataset generating narrative captions based on the three axes. All the models are released on Huggingface Hub 🤗.
Model | Cider | SacreBLEU | ROUGE-L |
---|---|---|---|
GIT-base | 103.00 | 24.67 | 33.90 |
BLIP-base | 116.00 | 26.46 | 35.30 |
ClipCap (LM+Mapping) | 145.00 | 36.73 | 42.83 |
Model | Cider | SacreBLEU | ROUGE-L |
---|---|---|---|
GIT-base | 110.63 | 15.21 | 30.43 |
BLIP-base | 123.07 | 17.16 | 32.16 |
ClipCap (LM+Mapping) | 176.54 | 27.37 | 39.15 |
Model | Cider | SacreBLEU | ROUGE-L |
---|---|---|---|
GIT-base | 42.58 | 5.90 | 18.57 |
BLIP-base | 46.11 | 6.21 | 19.74 |
ClipCap (LM+Mapping) | 78.04 | 11.71 | 25.76 |
Model | Cider | SacreBLEU | ROUGE-L |
---|---|---|---|
GIT-base | 75.78 | 11.11 | 27.61 |
BLIP-base | 79.39 | 11.70 | 26.17 |
ClipCap (LM+Mapping) | 63.91 | 8.15 | 24.53 |