Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version #4138

Merged

Conversation

mwydmuch
Copy link
Contributor

@mwydmuch mwydmuch commented Sep 2, 2022

Hello!

This PR aims to add support for probabilities output for the PLT reduction + fix it in version 9+.

So far, I added support for the probabilities option, which makes the reduction to output (label_index, probability) as action_score pairs. I also added an additional dataset to the demo, that only needs a few seconds to run.

Unfortunately, when implementing support for probabilities output, I've noticed that the PLT reduction stopped working correctly between 8.11.0 and 9.0.0 releases. You can run the new demo I added on v8.11.0 and v9.0.0 and notice that on 8.11.0 it has nice performance, and for 9+ it returns almost constant prediction. In the past, there was a test for performance of this reduction implemented in test dir, but it seems to me that it was removed before 9.0.0 release, probably because of that, it hasn't been noticed. So this needs to be fixed. The release notes for 9.0.0 does not mention anything that could be a reason for that. Were there any modifications to single_learner (learner<char, example>) that could change its behavior? Any pointers?

Model format changes:

  • By default we fail for <9.7 models being loaded
  • A user can override this with --plt_force_load_legacy_model which will be removed in the next release
  • Removed the save/load of resume
  • nodes_time is now unconditionally saved in the model. In addition its size is saved in the model and we do an assert to ensure t and nodes_time.size() stay in sync.
    • This is a format change due to the removal of resume and prefixed size field when writing nodes_time and therefore this model cannot be loaded in vw binaries prior to 9.7

@mwydmuch mwydmuch changed the title [WIP] Update for probabilistic label tree reduction [WIP] Update for probabilistic label tree reduction (#2766) Sep 2, 2022
@mwydmuch mwydmuch changed the title [WIP] Update for probabilistic label tree reduction (#2766) [wip] update for probabilistic label tree reduction (#2766) Sep 2, 2022
@jackgerrits
Copy link
Member

jackgerrits commented Sep 2, 2022

Thanks so much for this!

I am not aware of any PLT tests being removed. The testing infrastructure changed though. I see a number of PLT tests in the main test file

"desc": "check plt training",
Is that what you are referring to? Or are there different tests you mean?

I am not aware of any changes that came in that time frame that would break things. I will spend some time today comparing 8.11.0 and 9.0.0 with the new demo you added and let you know what I find.

EDIT: This bug was fixed in 9.3.0, that definitely broke PLT for a few versions. Does 9.3.0 behave more expectedly? The tests that were in place did not catch this bug unfortunately.

@mwydmuch
Copy link
Contributor Author

mwydmuch commented Sep 2, 2022

@jackgerrits I apologize I tried to search for that test, but somehow didn't find it initially, thank you for linking.
The test trains the model with --sgd option (probably added for speed), the demos are without --sgd since the default adaptive optimizer usually gives a better predictive performance. My quick test shows, that with --sgd it works in both v8 and v9, but for the adaptive, the results are good in v8 but not in v9 (I'm using the current master), so something probably changed in the behavior of the adaptive classifier between versions. I will try to investigate further later.

@jackgerrits
Copy link
Member

The main difference between 8 and 9 is save_resume becoming the default. To go back to pre-9 behavior --predict_only_model can be used when saving and loading models. Perhaps that is the difference?

@mwydmuch
Copy link
Contributor Author

mwydmuch commented Sep 2, 2022

This seems to be it! With --predict_only_model, it works correctly. Ok, thank you, @jackgerrits, I will try to fix it.

@mwydmuch mwydmuch changed the title [wip] update for probabilistic label tree reduction (#2766) update for probabilistic label tree reduction (#2766) Nov 13, 2022
@mwydmuch mwydmuch changed the title update for probabilistic label tree reduction (#2766) feat: update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version Nov 13, 2022
@zwd-ms zwd-ms self-assigned this Nov 14, 2022
@mwydmuch mwydmuch changed the title feat: update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version feat: (WIP) update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version Nov 16, 2022
@mwydmuch mwydmuch changed the title feat: (WIP) update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version feat: update for probabilistic label tree reduction (#2766) - support for --probabilities option and fixed compatibility with VW 9+ version Nov 26, 2022
@jackgerrits
Copy link
Member

Sorry about the delay on getting back to you @mwydmuch. I'm working on this now and hope to get this merged tomorrow.

Is it ok? It also seems to me to be not consistent with other binary and multiclass reductions that output a single line per example.
We should stick with a single line per example. Outputting either probs or just labels based on probabilities is the standard way to do this. Generally speaking the output of --predictions matches the output prediction type of the reduction and since the flag changes the output type it's all consistent here.

65:0.630946 66:0.754826 67:0.625329

This still looks sorted by label index. Looks fine to me.

Modified loss reporting. The MULTILABEL::output_example calculates hamming loss (and expects output labels to be sorted by indices), I overwrite this for PLT (mostly by copying and modifying the code form multilabel.cc), for plt a sum of binary losses is returned during training and 0 for prediction. Not sure if this is ok, but seemed to be the most logical solution for this reduction.

Otherwise, the updated output looks reasonable to me. It would be better to document these somewhere in the wiki later.

I agree with Wangda. Totally reasonable. Reductions get to decide how they calculate loss so its good. Documenting in the wiki would be fantastic.

  1. It seems that MULTILABEL::output_example does not use all.sd->ldict, is that intended?

multilabels based reductions do not currently make use of this so should be ignored for now during output.

@jackgerrits
Copy link
Member

@mwydmuch CI should be happy now. I noticed that there aren't any tests for the new --probabilities functionality. Would you be able to add a test or two for that? As soon as that is done we can go ahead and merge this.

@mwydmuch
Copy link
Contributor Author

@jackgerrits, @zwd-ms thank you for your comments and improvements, and sorry again for the long response time.

I noticed that there aren't any tests for the new --probabilities functionality.

I've added tests for the --probabilities option.

I agree with Wangda. Totally reasonable. Reductions get to decide how they calculate loss so its good. Documenting in the wiki would be fantastic.

Cool, I will be happy to add a Wiki page describing PLT reduction, I have time to do it before the end of the year.

(...)

This still looks sorted by label index. Looks fine to me.

Sorry, I gave a bad example here. The PLT outputs labels in the order they are found during the tree search. In the case of top-k they are naturally ordered by decreasing probability. For prediction with threshold, the ordering look even more random (both probabilities and label indices are mixed), example:

84:0.555875,67:0.86485,51:0.67529,31:0.522847,33:0.864

If ordering by label indices is a hard assumption in VW, I can add some additional sorting to make it always ordered according to label indices.

An additional question about the small quality of life improvement.
Is it possible to set --loss_function to logistic by default for plt reduction (and if yes, then what is the intended way of doing it)? Right now, it always needs to be added as an argument. I think there was a code that was doing it in the implementation in v8.X.X, but it was changed later in v9.X.X.

@jackgerrits
Copy link
Member

Sorry, I gave a bad example here. The PLT outputs labels in the order they are found during the tree search. In the case of top-k they are naturally ordered by decreasing probability. For prediction with threshold, the ordering look even more random (both probabilities and label indices are mixed), example:

84:0.555875,67:0.86485,51:0.67529,31:0.522847,33:0.864

If ordering by label indices is a hard assumption in VW, I can add some additional sorting to make it always ordered according to label indices.

I think it is okay to leave it as is, all of the information is there so a user can sort as they wish. If you wanted to you could sort it but I'm not sure if by class or prob makes more sense. This can be easily fixed later if it is an issue for anyone.

An additional question about the small quality of life improvement. Is it possible to set --loss_function to logistic by default for plt reduction (and if yes, then what is the intended way of doing it)? Right now, it always needs to be added as an argument. I think there was a code that was doing it in the implementation in v8.X.X, but it was changed later in v9.X.X.

It is possible to override the loss_function type (example), but I think this example is not quite correct. We need to override the option value in addition to setting all.loss and we should print a warning if we are overriding a user supplied loss function.

I think both of these things are low priority and could be merged without (and/or fixed later), so if you'd rather just see this merged then I'm happy.

@mwydmuch
Copy link
Contributor Author

@jackgerrits Thanks for fixing the warning.

I think both of these things are low priority and could be merged without (and/or fixed later), so if you'd rather just see this merged then I'm happy.

Ok, then let's merge it, I'm happy with the state it is right now.
I will write a Wiki page on plt reduction soon.

@jackgerrits
Copy link
Member

jackgerrits commented Dec 21, 2022

Just noticed the changes to save_load, are they necessary? If we make changes to this function we'll need to make sure we can still load models that were written by older versions of VW but write the new version going forward.

If the change is necessary I can go in and fix it up.

Comment on lines 345 to 348
for (size_t i = 0; i < p.t; ++i)
{
for (size_t i = 0; i < p.t; ++i)
{
bin_text_read_write_fixed(
model_file, reinterpret_cast<char*>(&p.nodes_time[i]), sizeof(p.nodes_time[0]), read, msg, text);
}
bin_text_read_write_fixed(
model_file, reinterpret_cast<char*>(&p.nodes_time[i]), sizeof(p.nodes_time[0]), read, msg, text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was changed to not use the resume value any more. To avoid changing the binary format of the model we can read/write this value but just ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackgerrits the small modification of this function is needed, cause the previous one has a bug that makes models load incorrectly (when resume is true). I've modified it to read/write resume variable as you suggested. I've tested it and it works correctly now with previous versions.

std::stringstream msg;
msg << ":" << resume << "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be left in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it back

bin_text_read_write_fixed(model_file, reinterpret_cast<char*>(&resume), sizeof(resume), read, msg, text);

if (resume && !p.all->weights.adaptive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this change?

Copy link
Contributor Author

@mwydmuch mwydmuch Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that the previous version is wrong because during the prediction p.all->weights.adaptive is always false (at least if no additional arguments are provided), so if the model was trained with p.all->weights.adaptive == true it's later loaded in the wrong way. Removing this and only depending on the resume variable fix the problem + in the current version this node_time array that is saved is used for all optimizer types (see learn_node function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better now since the current version does not depend on optimizer-specifics, even if the optimizer does not use sd->t (like adaptive), writing/reading one additional float per tree node is a small price for more general code.

Copy link
Contributor Author

@mwydmuch mwydmuch Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since !p.all->weights.adaptive == true during prediction, removing it still keeps compatibility with older most of the older models that were trained without a resume (default in version 8.X.X) and with resume and another optimizer than adaptive (in 9.X.X).

Copy link
Member

@jackgerrits jackgerrits Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spent some time to fully understand this situation. This previous !p.all->weights.adaptive == true check is actually quite troublesome and if the value of adaptive changes when loading a saved model then it is corrupted (it is relatively easy to change this value due to how gd's options work). If loading an older model (<9.7) with this save_load function then most code paths are invalid. The default (no optimizer flags) will also be invalid. The following summarizes the different combinations here:

Model produced by VW Adaptive value when saving Adaptive value when loading Result
<9.7 false true valid
<9.7 true true invalid (current default)
<9.7 false false valid
<9.7 true false invalid
>=9.7 n/a n/a valid

We cannot distinguish between the adaptive value when saving or loading easily in the current system (technically possible but tricky). Additionally, the method by which adaptive becomes true or false is a bit complex.

Additionally, any non-gd base optimizer is actually also undefined behavior because the p.all->weights.adaptive value is actually uninitialized if gd is not the base.

Based on all these facts I think the safest thing to do is disallow loading of <9.7 plt models but allow an escape hatch just for version 9.7 to force the load.

We will need to bump the version to 9.7 in this PR for the check to work.

If all sounds good to you, I am happy to push a commit making these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackgerrits Thank you for putting your time into this. Yes, your table well summarizes the situation. I agree that disallowing or at least not setting some warning is a good thing in this situation. So I'm happy with your proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for confirming! I'll fix this up tomorrow then we should be good to merge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed the change:

  • By default we fail for <9.7 models being loaded
  • A user can override this with --plt_force_load_legacy_model which will be removed in the next release
  • I remove the save/load of resume as you had originally done in this PR
  • nodes_time is now unconditionally saved in the model. In addition its size is saved in the model and we do an assert to ensure t and nodes_time.size() stay in sync.
    • This is a format change due to the removal of resume and prefixed size field when writing nodes_time and therefore this model cannot be loaded in vw binaries prior to 9.7

@jackgerrits jackgerrits merged commit 758ed66 into VowpalWabbit:master Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants