Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about testing on new data #30

Open
zhaoxy92 opened this issue Jul 31, 2019 · 4 comments
Open

Question about testing on new data #30

zhaoxy92 opened this issue Jul 31, 2019 · 4 comments

Comments

@zhaoxy92
Copy link

Hi, I'm trying to run ZOE on a new dataset and the following questions were raised:

  1. In the main.py, should I comment out runner.elmo_processor.load_cached_embeddings("target.min.embedding.pickle", "wikilinks.min.embedding.pickle")? If yes, could you show me how these two files are generated and what are the format for the raw version of these two files? Currently I found running new data is extremely slow (processed 30 sentences after one night). Anything idea how I can speed up things?

  2. Are there any other files/data I need to generate for testing on new dataset? (maybe vocab_test.txt?)

Thank you!

@Slash0BZ
Copy link
Member

  1. The speed is slow on non-cached Wikipedia titles, especially on CPUs, because it runs multiple ELMo inferences to generate a title's representation. I could provide a huge SQLite file (~72GB) that contains all the Wikipedia titles, do you want me to share it? By having that file, you could use this function instead of load_cached_embeddings. Furthermore, it is recommended to cache your test set as well, i.e. store what candidates are found at each instance so that you can tune your type inference at a low cost. To do this, I would suggest storing results into a map and pickle that map.

  2. Everything should work fine if you have your type mapping (inference) part working. The previous point only speeds things up, without any impact on the results.

@zhaoxy92
Copy link
Author

zhaoxy92 commented Jul 31, 2019 via email

@Slash0BZ
Copy link
Member

Slash0BZ commented Aug 1, 2019

Updated the file "elmo_cache_correct.db" in the Google Drive https://drive.google.com/drive/u/1/folders/1fD6WfCEPQICGPhxqlwuVmf8uOot-jQq8?ths=true. Sorry for the delay, it's a huge file to upload.

To use it, please refer to the function pointer above, and set server_mode=False.

@zhaoxy92
Copy link
Author

zhaoxy92 commented Aug 9, 2019

Thank you. Downloading it now, will bother you more if there is any further problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants