Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mono models to huggingface model-zoo and incorporate into pipeline #29

Merged
merged 9 commits into from
May 24, 2020
Merged

Conversation

MXueguang
Copy link
Member

No description provided.

device = torch.device(options.device)

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of try/except, should we pass a flag to distinguish PT from TF checkpoints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire segment of code should give an error since loader is not defined anymore?

Copy link
Member Author

@MXueguang MXueguang May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigonogueira4 Yes, I feel a flag makes more sense. will add an flag --from_tf for evaluate_passage_ranker.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronakice oh, I didn't notice there is a one change difference between my gcloud and local. Will remove that line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huggingface's work-around is something like bool("ckpt" in model_name_or_path)

Copy link
Member Author

@MXueguang MXueguang May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_name_or_path is a directory may contains both "ckpt" and pytorch_model.bin ? https://huggingface.co/google/bert_uncased_L-10_H-128_A-2#list-files , so maybe still need to be stated by user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid problems with PT files containing ckpt in their name, I would just go with the flag.

@ronakice
Copy link
Member

I feel like the model in the model-zoo is missing the vocab file?

@MXueguang
Copy link
Member Author

MXueguang commented May 24, 2020

I feel like the model in the model-zoo is missing the vocab file?

@ronakice Is this because we are using t5-base as tokenizer?

@ronakice
Copy link
Member

ronakice commented May 24, 2020

https://huggingface.co/valhalla/t5-base-squad#list-files compare this with our list-files and it seems like our model is missing a few! Not sure if all are useful though!

@MXueguang
Copy link
Member Author

MXueguang commented May 24, 2020

Currently in construct_t5()

  • we load model by T5ForConditionalGeneration.from_pretrained("castorini/monot5-base-msmarco").
  • we load tokenizer by AutoTokenizer.from_pretrained("t5-base")
    which from different models on HuggingFace.

I think the missed files are from tokenizer.(e.g. config file and spiece.model or vocal file). The pretrained model itself doesn't require these files. (The original gs://neuralresearcher_data/doc2query/experiments/367 doesn't contains vocab files as well.)

2020-05-24 02:04:42 [INFO] configuration_utils: loading configuration file https://s3.amazona
ws.com/models.huggingface.co/bert/t5-base-config.json from cache 

2020-05-24 02:04:42 [INFO] tokenization_utils: loading file https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model from cache

Do we need to find a way to integrate the pretrained tokenizer files to our model files?

@MXueguang
Copy link
Member Author

I don't think we need default='false' if action='store_true'. Could you double-check?

Yes, it can be removed.

@MXueguang
Copy link
Member Author

revalidated experiment result on latest update

monot5
581590290744_ pic_hd

monoBERT
591590294894_ pic_hd

Copy link
Member

@rodrigonogueira4 rodrigonogueira4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for implementing this!

@ronakice
Copy link
Member

Great! We can merge this. Thanks for doing this @MXueguang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants