Question about the inference pipeline #3

bshfang · 2022-10-07T17:51:10Z

Hi,

I evaluated the pretrained model provided in this repo, whose result is similar to the StretchBEV-P model in the paper.

However, it seems that ground truth labels for history and current timeframe are involved in the evaluation process. In inference_srvp_generate() function, hx_z contains feature generated from label input.

Under this setting, the comparison with fiery in the paper seems unfair, and the extra data usage in the evaluation doesn't match the caption of table 1 in the paper

... the two versions of our model with (StretchBEV-P) and without (StretchBEV) using the labels for the output modalities in the posterior while learning the temporal dynamics ...

YTEP-ZHI · 2022-10-09T06:03:11Z

Same question...

EdwardLeeLPZ · 2022-11-09T11:25:11Z

Hi,

I evaluated the pretrained model provided in this repo, whose result is similar to the StretchBEV-P model in the paper.

However, it seems that ground truth labels for history and current timeframe are involved in the evaluation process. In inference_srvp_generate() function, hx_z contains feature generated from label input.

Under this setting, the comparison with fiery in the paper seems unfair, and the extra data usage in the evaluation doesn't match the caption of table 1 in the paper

... the two versions of our model with (StretchBEV-P) and without (StretchBEV) using the labels for the output modalities in the posterior while learning the temporal dynamics ...

Thanks for the detailed description. I also found the same problem, but I think the main information leaking happens at line 301 in srvp_generate() which is also called directly by the forward() function.

hx_z = self.inf_z(torch.cat([x, future_inputs], dim=2))  # encodes srvp_encoder's output temporally
y_tm1 = y_0
for t in range(1, total_time):

    # prior distribution
    p_z_t_params = self.p_z(y_tm1)
    p_z_params.append(p_z_t_params)

    z_t, q_z_t_params = self.infer_z(hx_z[:, t])
    q_z_params.append(q_z_t_params)

    if self.training or t < self.receptive_field:
        # observations are available
        pass
    else:
        assert not self.training
        z_t = model_utils.rsample_normal(p_z_t_params, max_log_sigma=self.max_log_sigma,
                                         min_log_sigma=self.min_log_sigma)
    # Residual step
    y_t = self.residual_step(y_tm1, z_t)

During inference (self.training == False), all time points before self.receptive_field (t < self.receptive_field) directly use the z_t sampled from the posterior distribution instead of the prior distribution inferred by y_tm1 . And this z_t is inferred by hx_z, which contains ground truth information (obtained from future_inputs). This means that in the inference process, the model also uses the label information.

To verify this, I modified the original code if self.training or t < self.receptive_field: (original_implementation) to if self.training and t < self.receptive_field: (no_leaking_1) and if self.training: (no_leaking_2) to avoid information leakage. The comparison results obtained are as follows:

It can be seen that when the information leakage is modified, the performance of StretchBEV-P drops significantly (much worse than the baseline). Does this further shows that the results of StretchBEV-P are not convincing? Or is there something wrong with my modifications and implementation?

kaanakan · 2022-12-09T10:39:22Z

Hello,

As mentioned in the paper, we use the posterior distribution in the conditioning frames (t < receptive_field) to update the state variables. In the prediction phase (t > receptive_field), we use prior distribution. In StretchBEV-P, the posterior is sampled from both future state information extracted from images and GT labels. On the other hand, the prior distribution is sampled from the current state variable. The evaluation results only cover the prediction phase, which is predicted without posterior distribution, only with the prior distribution.

You are right, if we use only prior distribution, the performance drops. However, we proposed StretchBEV for this purpose. In StretchBEV, posterior distribution only uses future state information extracted from images, not the GT labels. The code does not include that version; however, we will release it as soon as possible.

Feel free to ask if you have further questions.

EdwardLeeLPZ · 2022-12-09T20:03:55Z

Hello,

As mentioned in the paper, we use the posterior distribution in the conditioning frames (t < receptive_field) to update the state variables. In the prediction phase (t > receptive_field), we use prior distribution. In StretchBEV-P, the posterior is sampled from both future state information extracted from images and GT labels. On the other hand, the prior distribution is sampled from the current state variable. The evaluation results only cover the prediction phase, which is predicted without posterior distribution, only with the prior distribution.

You are right, if we use only prior distribution, the performance drops. However, we proposed StretchBEV for this purpose. In StretchBEV, posterior distribution only uses future state information extracted from images, not the GT labels. The code does not include that version; however, we will release it as soon as possible.

Feel free to ask if you have further questions.

Thank you @kaanakan for your response and continued updates.
But unfortunately your reply did not convince me, but further increased my suspicion.

Doubts about StretchBEV-P:

... we use the posterior distribution in the conditioning frames (t < receptive_field) to update the state variables ... In StretchBEV-P, the posterior is sampled from both future state information extracted from images and GT labels.

Personally, I don't think the GT labels should be included at any phase of the inference process, either in the conditioning frames or the prediction phase. Because the state variables you mentioned will be used all the way to the end of the prediction, in other words, the GT information you get in the conditioning frames will also be passed to the calculation in the prediction phase in some way. And that's where the information leaks happen.

In addition, from my personal understanding, the correct way to use the posterior should be: make the prior distribution extracted by the model as close to the posterior as possible only during the training process, while during the inference process, compute the prior distribution using the above-mentioned well-trained model (at this time, the prior is assumed to be close enough to the posterior), without including any posterior input.

As I demonstrated in my previous experiments, the way your posterior distribution is used results in the model relying more on the posterior input than on the features perceived from the image during inference.

Please correct me if there is something wrong with my idea. Thanks.

The evaluation results only cover the prediction phase, which is predicted without posterior distribution, only with the prior distribution.

This expression is tricky.

Based on this idea, is it possible to claim that the validity of the end-to-end framework can still be verified by evaluating the model only for the prediction phase, even if the conditional terms or the hidden states have GT information?

If so, then it would also be acceptable to omit the perception module completely and directly use the GT labels of the conditioning frames as input to the prediction phase. But this clearly does not meet the requirements of the end-to-end framework and thus is not comparable with other end-to-end approaches.

Doubts about StretchBEV:

You are right, if we use only prior distribution, the performance drops.

Does this also suggest that the performance improvement of StretchBEV-P is due to information leakage?

However, we proposed StretchBEV for this purpose. In StretchBEV, posterior distribution only uses future state information extracted from images, not the GT labels.

I fully believe in the correctness and feasibility of StretchBEV and agree that it can be used as a benchmark.

But unfortunately, the performance of the model itself is far inferior to that of the FIERY baseline, and even with pre-training, it can barely beat FIERY. So what are the benefits of StretchBEV? Or it indicates that the structure of SRVP is applicable to video prediction rather than BEV prediction.

In conclusion, my questions can be summarized as follows:

Is the posterior distribution containing GT labels in StretchBEV-P incorrectly passed to the prediction part during inference, i.e., information leakage?
What is the advantage of StretchBEV over FIERY?

Thank you again for your explanation and looking forward to your more substantive reply!

kaanakan · 2022-12-09T20:24:33Z

Hello,

Thank you for your questions. The term information leakage is a little tricky here. It sounds like StretchBEV-P uses labels to predict the future, and the predicted future is evaluated. The GT labels are used only in the conditioning phase (t < receptive_field) for the posterior distribution, and after that point, the prior distribution is used to predict future states, where there is no information coming from the labels. So, there is no way that the information is leaked towards the prediction phase. I hope this answers your first question. There are some ways to make the prior distribution close to the posterior distribution in curriculum learning literature, which is beyond the scope of this work but applying those methods can/do improve the quality of the prior distribution, which removes the need to use posterior distribution in the inference part.

For the second question, StretchBEV performs on par or a little worse compared to FIERY. However, there are two essential parts. First, the most important aspect of our work is diversity. Both our models have much more diverse results proved both quantitatively and qualitatively. We believe that one of the important points of both our models is that they can generate much more diverse results for a given input. Second, StretchBEV can make use of pre-training. You can use a pre-trained backbone with StretchBEV to train on unlabeled data to have better performance. In this work, we only use the NuScenes dataset for unsupervised pre-training, but one can use any autonomous driving dataset to first pre-train the model in an unsupervised way. Then, the model can be trained with supervision to have better performance.

Best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the inference pipeline #3

Question about the inference pipeline #3

bshfang commented Oct 7, 2022

YTEP-ZHI commented Oct 9, 2022

EdwardLeeLPZ commented Nov 9, 2022

kaanakan commented Dec 9, 2022

EdwardLeeLPZ commented Dec 9, 2022

kaanakan commented Dec 9, 2022

Question about the inference pipeline #3

Question about the inference pipeline #3

Comments

bshfang commented Oct 7, 2022

YTEP-ZHI commented Oct 9, 2022

EdwardLeeLPZ commented Nov 9, 2022

kaanakan commented Dec 9, 2022

EdwardLeeLPZ commented Dec 9, 2022

Doubts about StretchBEV-P:

Doubts about StretchBEV:

In conclusion, my questions can be summarized as follows:

kaanakan commented Dec 9, 2022