Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use the features of last 4 layers? #81

Open
Kal-Elll opened this issue Feb 7, 2024 · 6 comments
Open

Why use the features of last 4 layers? #81

Kal-Elll opened this issue Feb 7, 2024 · 6 comments

Comments

@Kal-Elll
Copy link

Kal-Elll commented Feb 7, 2024

We note that you use the features of last 4 layers from the encoder instead of intermediate layers (e.g. [5, 12, 18, 24] for vitl) as in some other works such as DINOv2. What's the reason for that and is there any remarkable difference between these two strategy?

@LiheYoung
Copy link
Owner

Honestly, it is not an intentional practice. And we appreciate your reminder. Thank you.

@heyoeyo
Copy link

heyoeyo commented Feb 13, 2024

I've played around with the 4 image encoder outputs and found that the results are not especially sensitive to throwing away some of the outputs. For example, for vit-large, if you repeat block 20 for all 4 outputs (instead of using 20,21,22,23 as normal), the result tends to look qualitatively similar. The same is true for blocks 21 and 22, though block 23 gives a distorted result. A similar pattern holds for vit-b and vit-s, except block 23 doesn't give distorted results with vit-s, weirdly.

Here's a comparison between different outputs for vit-large, using an increased processing resolution. The top-right is the 'normal' output, while the bottom-left is the result from repeating block 20 for all 4 outputs:

vit_l_comparisons

Out of curiosity, I also tried skipping all but the last fusion step (the one which takes the block 20 result as an input), in which case you get the result in the bottom-right. It seems to have more details than normal, though it's incorrect as a depth map. It may just be missing the low-frequency/ramp that would normally make it hard to see the details, which suggests the fusion step may deal with 'frequency' information?

I'm not sure what to make of it, but it's an interesting result! At the very least, it feels like the model might perform just as well (with a bit of fine tuning) with only 1 or 2 of the outputs instead of all 4, which might speed up inference a bit.

@LiheYoung
Copy link
Owner

Hi @heyoeyo, thank you for sharing such an interesting observation!

@heyoeyo
Copy link

heyoeyo commented Mar 17, 2024

I've done a few more experiments with this and found that the Depth-Anything vit-large model can consistently generate these hi-detail outputs by scaling the fusion steps. For example, for vit-l, you can try adding scaling factors on the last two steps, which seem to have the biggest impact:

path_2 = self.scratch.refinenet2(path_3 * 0.15, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(path_2 * 0.7, layer_1_rn)

I've set up an interactive demo for this, in case anyone wants to play around with it to see what the fusion layers do (fusion 2 on vit-l has a blurring effect when scaled >1, for example):

fusescaling_shanghai

For anyone interested, I have a repo: MuggledDPT that includes other scripts for interacting with the Depth-Anything (and other DPT) outputs, including taking a webcam input. Not sure if you're still taking community repos @LiheYoung, but you're welcome to add this to the listing if you like, it's mostly meant to be an educational/explainer repo.

@LiheYoung
Copy link
Owner

Thank you for providing these surprising and interesting observations! I tested more images and had similar observations as yours:

  • skipping the first three fusion blocks will produce sharper predictions, but are not correct enough from the perspective of MDE metrics.
  • decreasing the fusion weights of the path_3 and path_2 at test time will also produce sharper predictions.

Btw, when only using the final fusion block, did you use it as:

path1 = self.scratch.refinenet1(layer_1_rn, layer_1_rn) # replace the original "path_2" variable with "layer_1_rn"

Thank you again. I will definitely add your repo MuggledDPT to our repo in our next update.

@heyoeyo
Copy link

heyoeyo commented Apr 1, 2024

Btw, when only using the final fusion block, did you use it as...

I can't remember exactly, but I think I did something equivalent to:

layer_1_rn = self.scratch.layer1_rn(layer_1)
#layer_2_rn = self.scratch.layer2_rn(layer_2)
#layer_3_rn = self.scratch.layer3_rn(layer_3)
#layer_4_rn = self.scratch.layer4_rn(layer_4)
        
#path_4 = self.scratch.refinenet4(layer_4_rn, size=layer_3_rn.shape[2:])
#path_3 = self.scratch.refinenet3(path_4, layer_3_rn, size=layer_2_rn.shape[2:])
#path_2 = self.scratch.refinenet2(path_3, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(torch.zeros_like(layer_1_rn), layer_1_rn)

(it only 'works' with vit-l. The base and especially small models are distorted by this)

The other example 'using block 20 only' was done with a modification equivalent to changing the loop over the image encoder features to something like:

for i, x in enumerate(out_features[0] for _ in range(4)):

Which also has odd behavior. Repeating index [0], [1] or [2] gives nearly identical (good) results, but repeating index [3] gives a distorted output, at least for vit-l. It seems to suggest that there is something wrong/different with the last layer output. In case you hadn't seen it, there's a paper: "Vision Transformers Need Registers" that mentions artifacts in the later layers of the dinov2 encoder which is also evident in the depth-anything models. Might have something to do with these odd behaviors, though vit-l has artifacts starting on blocks 15-17, so I'm not really sure.

Thank you again. I will definitely add your repo MuggledDPT to our repo in our next update.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants