Skip to content

Latest commit

 

History

History
104 lines (69 loc) · 3.87 KB

RESULTS.md

File metadata and controls

104 lines (69 loc) · 3.87 KB

Results

Preamble

Our code, by default, uses automatic mixed precision (AMP). Its effect on the output is negligible. All speeds reported in the paper are recorded with AMP turned off (--benchmark). Due to refactoring, there might be slight differences between the outputs produced by this code base with the precomputed results/results reported in the paper. This difference rarely leads to a change of the least significant figure (i.e., 0.1).

For most complete results, please see the paper (and the appendix)!

All available precomputed results can be found [here].

Pretrained models

We provide four pretrained models for download:

  1. XMem.pth (Default)
  2. XMem-s012.pth (Trained with BL30K)
  3. XMem-s2.pth (No pretraining on static images)
  4. XMem-no-sensory (No sensory memory)

The model without pretraining is for reference. The model without sensory memory might be more suitable for tasks without spatial continuity, like mask tracking in a multi-camera 3D reconstruction setting, though I would encourage you to try the base model as well.

Download them from [GitHub] or [Google Drive].

Long-Time Video

[Precomputed Results]

Long-Time Video (1X)

Model J&F J F
XMem 89.8±0.2 88.0±0.2 91.6±0.2

Long-Time Video (3X)

Model J&F J F
XMem 90.0±0.4 88.2±0.3 91.8±0.4

DAVIS

[Precomputed Results]

DAVIS 2016

Model J&F J F FPS FPS (AMP)
XMem 91.5 90.4 92.7 29.6 40.3
XMem-s012 92.0 90.7 93.2 29.6 40.3
XMem-s2 90.8 89.6 91.9 29.6 40.3

DAVIS 2017 validation

Model J&F J F FPS FPS (AMP)
XMem 86.2 82.9 89.5 22.6 33.9
XMem-s012 87.7 84.0 91.4 22.6 33.9
XMem-s2 84.5 81.4 87.6 22.6 33.9
XMem-no-sensory 85.1 - - 23.1 -

DAVIS 2017 test-dev

Model J&F J F
XMem 81.0 77.4 84.5
XMem-s012 81.2 77.6 84.7
XMem-s2 79.8 61.4 68.1
XMem-s012 (600p) 82.5 79.1 85.8

YouTubeVOS

We use all available frames in YouTubeVOS by default. See INFERENCE.md if you want to evaluate with sparse frames for some reason.

[Precomputed Results]

[Precomputed Results (sparse)]

YouTubeVOS 2018 validation

Model G J-Seen F-Seen J-Unseen F-Unseen FPS FPS (AMP)
XMem 85.7 84.6 89.3 80.2 88.7 22.6 31.7
XMem-s012 86.1 85.1 89.8 80.3 89.2 22.6 31.7
XMem-s2 84.3 83.9 88.8 77.7 86.7 22.6 31.7
XMem-no-sensory 84.4 - - - - 23.1 -

YouTubeVOS 2019 validation

Model G J-Seen F-Seen J-Unseen F-Unseen
XMem 85.5 84.3 88.6 80.3 88.6
XMem-s012 85.8 84.8 89.2 80.3 88.8
XMem-s2 84.2 83.8 88.3 78.1 86.7

Multi-scale evaluation

Please see the appendix for quantitative results.

[DAVIS-MS Precomputed Results]

[YouTubeVOS-MS Precomputed Results]