Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Am I properly using stanza offline (coref English model - Electra Large)? #1399

Open
Zappandy opened this issue Jul 2, 2024 · 5 comments
Open
Labels

Comments

@Zappandy
Copy link

Zappandy commented Jul 2, 2024

I'm currently attempting to run a pipeline I had built on my local machine with stanza on an HPC with no access to the huggingface hub or the stanza server. To bypass this, I downloaded all of the models I needed and set the download_method to None. While this seemed to work with most processors in English, the coreference processor bypassed the local files and kept trying to download the google/electra-large model.

After setting environment variables such as HF_HUB_CACHE to the corresponding path where the HF cache has been stored in the HPC and HF_HUB_OFFLINE='1', the huggingface pretrained method from the bert.py script in the models coref directory kept attempting to download files. I found out that to avoid any downloads, the parameter local_files_only in the from_pretrained method must be set to True (I tested this locally with no internet connection).

Unless I'm missing something, with the current setup I don't see how I can pass this parameter to the pre_trained methods in the ~bert.py script without explicitly doing so in the script as the config object used is not the same stanza config dictionary I defined. It seems to me that the config object that it's read in the script is fetched from the model .pt file using the torch.load method, which of course means the config won't contain the local_files_only parameter.

Am I missing something or is this an expected functionality?

@AngledLuffa
Copy link
Collaborator

Thanks, this is a good observation. So what I'm hearing is that we need some way to pass local_files_only to the code path(s) that load the transformers, right? But probably also to this line, which doesn't have any config at all:

    model = AutoModel.from_pretrained(config.bert_model).to(config.device)

@Zappandy Zappandy closed this as completed Jul 3, 2024
@Zappandy Zappandy reopened this Jul 3, 2024
@Zappandy Zappandy closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
@Zappandy Zappandy reopened this Jul 3, 2024
@Zappandy
Copy link
Author

Zappandy commented Jul 3, 2024

Yes. I don't know how feasible it'd be to pass specific transformers configurations to the stanza pipeline config dictionary the user defines. This may be too much, but at least in terms of an offline mode, the local_files_only should be passed to any pre_trained method as long as the user has set a cache directory where the models and tokenizers are stored.

An alternative is just to pass the local path to the from_pretrained methods, but this is less portable.

@AngledLuffa
Copy link
Collaborator

Are you comfortable using branches? We made the local_files_only branch so that download_method=None now doesn't download from HF either, in addition to not downloading Stanza models.

Only caveat is the coref model is changed now, to one which detects singletons and uses xlm-roberta as the base model.

@AngledLuffa
Copy link
Collaborator

Fixed on dev? #1408

@Zappandy
Copy link
Author

Thanks, yeah I'm comfortable using branches. I'll test it on the dev branch and try to report back asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants