Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues Training Models for New Language Support #1343

Open
LilitKharatyan opened this issue Feb 14, 2024 · 25 comments
Open

Issues Training Models for New Language Support #1343

LilitKharatyan opened this issue Feb 14, 2024 · 25 comments
Labels

Comments

@LilitKharatyan
Copy link

I am trying to train a pipeline for a new language (xcl). My goal is to train the full pipeline (tokenizer, tagger, parser, lemmatizer, and morphological parser) for this language, starting with the tokenizer. I've followed the instructions provided in the official documentation and GitHub repository closely but have encountered several issues that hinder my progress.

Here are the steps I've taken and the issues encountered:

  1. Setting Up Environment and Data: After organizing my .conllu files for training and validation as per the guidelines, I set the environment variables in config.sh and sourced it. My data is for the language code xcl, which is not recognized by Stanza, so I used HY (Armenian) as a temporary workaround.

  2. Training the Tokenizer: When attempting to train the tokenizer using the command
    python3 -m stanza.utils.training.run_tokenizer HY_Classical_Armenian

I encountered a FileNotFoundError related to missing .toklabels files, which should have been generated during the data preparation step. The exact error message was:

FileNotFoundError: [Errno 2] No such file or directory: 'data/tokenize/hy_classical_armenian-ud-train.toklabels'

This indicates that either the preparation step was missed or did not complete successfully, or there's a mismatch in the expected directory structure or naming convention. However, following the instructions from the documentation and GitHub, it wasn't clear how to proceed with the preparation step for a language not yet recognized by Stanza.

  • Could you provide more detailed instructions or clarification on how to correctly prepare the data for model training, especially for new languages not currently supported by Stanza? This includes generating the necessary .toklabels files.

  • Is there a recommended approach to adding support for entirely new languages, ensuring that all necessary preprocessing and setup steps are covered?

  • Any advice on troubleshooting or steps I might have overlooked would be greatly appreciated. I am especially interested in any scripts or commands specific to preparing data for new languages.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Feb 14, 2024 via email

@LilitKharatyan
Copy link
Author

Thank you very much, I managed to train tokenizer, lemmatizer, POS and dependency parser. Now, I cannot deploy them. Whenever I try to run the trained models, for some reason, English models come forward. How can I run my models from the saved models folder. Alos since we have quite good results and a whole pipeline ready, how can we add XCL to the official languages, and make our models public? Thanks

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Feb 26, 2024 via email

@AngledLuffa
Copy link
Collaborator

Please let us know if we can help host the models - the main thing we need to be able to rebuild them going forward is the code changes and links to the data sources.

@LilitKharatyan
Copy link
Author

Thank you for your response! I am currently retraining the models with some additional data! I guess in an hour or so they will be ready and I will try to execute them. If I still run into problems I will get back to you!

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Mar 18, 2024 via email

@LilitKharatyan
Copy link
Author

Thank you for the suggestions.
Now we have tokenizer, lemmatizer, POS Tagger and Dependency parser for classical Armenian, with pretty good results. Can you please guide us how (if) we can add them to the list of languages Stanza supports? If so, what do we need to provide for that? Thank you!

@AngledLuffa
Copy link
Collaborator

The easiest way is if you're able to turn your dataset into a UD treebank:

https://universaldependencies.org/

If that's not an option, but you have some other repo for the data, please let us know. Any code changes or new scripts you needed to convert the data to the Stanza input formats would also be very helpful.

@LilitKharatyan
Copy link
Author

Thanks for the information. Actually, our dataset is available in UD and the models are trained on the last released dataset of UD Classical Armenian. We have the four models I mentioned earlier (tokenizer, lemmatizer, POS Tagger and Dependency parser) so we wanted to know if it is possible to make those models part of the official package of Stanza. Thanks

@AngledLuffa
Copy link
Collaborator

Sounds good. Yes, we can definitely add that as a model. Are there word vectors or other resources we need to make those models aside from the UD dataset?

@LilitKharatyan
Copy link
Author

Thank you for your quick answer. Yes there are word vectors that have been used for the training (and deployment too). We will provide those too.

@AngledLuffa
Copy link
Collaborator

Unless there's anything unusual about the training, we'll be happy to use those word vectors to rebuild the models as part of our updates for new UD releases. If there's something specific we have to do, please let us know or make a PR

@LilitKharatyan
Copy link
Author

Thank you! How would you like me to send you the vectors? And what is the expected new release date? Thanks

@AngledLuffa
Copy link
Collaborator

Probably late June on account of other work commitments, and people have posted them in box or dropbox for example in the past

I may even be able to find some storage at the university which we can share. How big are they?

@LilitKharatyan
Copy link
Author

Thank you for your prompt response.
We have a specific question regarding the licensing of AI models. Our vectors have been trained on a dataset to which we were granted access under the condition that both the dataset and any AI models derived from it can only be used for non-profit purposes.
Could you please clarify how licensing is regulated on your side? Additionally, do you believe this usage restriction can be effectively controlled? We prefer to license our models under the CC BY-NC-ND 4.0 license. Is this possible within your framework?
Thank you for your assistance.

@manning
Copy link
Member

manning commented May 28, 2024

Ultimately, like nearly all open source projects, we are dependent on users observing their license terms. (Stanza is licensed under the generous Apache license, which does allow commercial use, but there are still license terms.) We are happy to label the models with their license and to point out that the models are restricted to non-commercial use, but we aren’t in a position to control usage (again, any more than a typical open source project). It’d be part of our offering as a non-profit ​organization, and we wouldn’t be charging anyone for their use.
Based on that, it’s up to you whether sharing the word vectors you trained in this way would violate your license agreement or not. We doubt that there will be commercial use of Classical Armenian models, but if you really wanted to regulate access, you’d need to keep the models and only make them available to people who have (say) filled out a license agreement.
You could write back to ask the original dataset owners whether or not this use case would violate the agreement or seems okay to them. Some licensors are more strict than others...

@LilitKharatyan
Copy link
Author

Thank you! Actually what you said makes sense. We will just ask to label them with the relevant license and that's it. The vector's file is not that big, I can even send it via email. What would you be more comfortable with?

@AngledLuffa
Copy link
Collaborator

For sending us the word vectors, email, dropbox, or anything that works for you will work great. Thanks!

@LilitKharatyan
Copy link
Author

Can I please ask for an email? I will send them right away.

@AngledLuffa
Copy link
Collaborator

oh, sorry, can you not see it from my account?

horatio@gmail.com

@LilitKharatyan
Copy link
Author

I can see it, just wasn't sure if that was the right address to send.
I just sent the vectors to you. If anything please let me know. Thanks

@AngledLuffa
Copy link
Collaborator

For the word vectors, is there a citation of some kind or other links I should put on this page?

https://stanfordnlp.github.io/stanza/word_vectors.html#word-vector-sources

@AngledLuffa
Copy link
Collaborator

Also, does caval refer to the university group in Australia, or is there someone else I should be crediting for this?

@LilitKharatyan
Copy link
Author

Hi. There is not a specific paper about the word vectors but we mention them briefly in our paper:
Caval refers to the following project of the University of Wurzburg.
In case there are more questions, please let us know

@AngledLuffa
Copy link
Collaborator

Sounds good. I will add that information to the documentation.

https://stanfordnlp.github.io/stanza/word_vectors.html#classical-armenian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants