-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Evaluator Datasets #120
Conversation
Should we resolve PR Gate issues before doing reviews, in case the code changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I like the structure. Overall, we should probably work on more purposeful docstrings that help the user understand why they need to fill out this field -- it'll go a long way towards making our library more user-friendly.
Otherwise, I am curious on how we want to design our abstraction for TQDM steps. Should evaluation time be part of the TQDM logger? Curious for thoughts from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is headed. Another thought I had is whether we want to allow some evaluators to be run every batch whereas others run every epoch? Or would we want different evaluators to run at different frequencies? Perhaps the validate_every_n_epochs
and validate_every_n_batches
parameters could be incorporated into the evaluator haprams.
Strong +1 to Ravi. I was starting to use this and I realized that I need evaluators to run every N steps during training on the validation set, and then run once after training is done on the test set. @anisehsani do we support this use case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress - biggest thing is my comment on the metric_registry
Just checking, where do we stand on this? |
1. Making the `Evaluator` hold metrics and metrics only, not metric names. The model is now passed in on `evalautor.initialize_object`. This cleans up a bunch of typing later on 2. Moved the `Evaluator` to its own file, so types.py remains small 3. Fixed typing issues throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to leave a review and then realized it would be easier if I just implemented some of the changes directly, so I added a commit for that. I'll continue the review after lunch.
outdated - substantial changes implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really close! Just a final few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the work here!
…s 3 and 4 (mosaicml#178) For the timing abstraction (mosaicml#146), the DataloaderSpec needed two addition functions -- get_num_samples_in_batch and get_num_tokens_in_batch. Moved the DataSpec class to composer.core, as the DataSpec is now bound directly to the state. mosaicml#120 will also need this change. This PR implements part 3 and 4 of the timing abstraction (mosaicml#146). The implementation differs from the GH issue by adding num_tokens, num_samples, get_batch_size, and get_num_tokens to the new DataSpec rather than the pytorch dataset class.
Allows a model to be evaluated with multiple datasets with metrics specifically relevant to that dataset. For example, the GLUE NLP tasks use different metrics for each dataset/task.
The basic data-structure introduced is the Evaluator which takes three parameters:
When a user doesn't want to use the evaluator objects to create a trainer, they can still use the val_dataset arguments, but the trainer will wrap the dataloader into an Evaluator object with validation metrics from the model.
Before Evaluators are created, the Hparams system creates EvaluatorSpecs, which are passed to the Trainer, which then creates Evaluator objects with and stores them in the state. The EvaluatorSpecs are similar to the Evaluators, but contain DataloaderSpecs instead of data loaders.
Here is an example of how you would add evaluator fields to your YAML.
The metrics in the YAML should have the same names as the metrics registered in the EvaluatorHparams class. To add a custom metric, you would add it to the metric_registry dictionary in the EvaluatorHparams class.
However, by using the eval_dataset argument, the trainer wraps the metrics associated with the model in an evaluator, so to use custom metrics, you could also just include them in the model and not in the metric registry.
Metrics and How They are Populated
There are several ways you can add metrics to an evaluator currently. By using the metric_names field, directly including metrics, or by using the model defaults.
Example:
You want to evaluate metric1 on dataset1 and metric2 on dataset2.
You can