Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate hosting for hedwig-data #61

Closed
achyudh opened this issue May 29, 2020 · 8 comments
Closed

Alternate hosting for hedwig-data #61

achyudh opened this issue May 29, 2020 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@achyudh
Copy link
Member

achyudh commented May 29, 2020

The dataset repo https://git.uwaterloo.ca/jimmylin/hedwig-data isn't ideal and requires the user to extract the embeddings and process them with a python script. Do we have any alternate ways to host ~5 GB of data that would make it easier for others to replicate our results out of the box?

@achyudh achyudh added the question Further information is requested label May 29, 2020
@achyudh achyudh changed the title Alternate way to host hedwig-data Alternate hosting for hedwig-data May 29, 2020
@lintool
Copy link
Member

lintool commented May 29, 2020

I can check in data directly into that repo for you. I think ~5GB is fine... Point me to what you want checked in.

@achyudh
Copy link
Member Author

achyudh commented May 29, 2020

I just added the pre-trained BERT weights to that repo, but it's pretty slow. For instance, running git status takes a minute.

@lintool
Copy link
Member

lintool commented May 29, 2020

Does this use hgf? Why not do the same as here? https://huggingface.co/castorini

@achyudh
Copy link
Member Author

achyudh commented May 29, 2020

It does use hgf. I guess this is something that is in the pipeline #56 but I was looking for something more immediate

@lintool
Copy link
Member

lintool commented May 29, 2020

Now that you've checked it in, wgeting from https://git.uwaterloo.ca/jimmylin/hedwig-data shouldn't be too bad...

@achyudh
Copy link
Member Author

achyudh commented May 29, 2020

Right now I am looking to eliminate these extra steps:

cd hedwig-data/embeddings/word2vec 
gzip -d GoogleNews-vectors-negative300.bin.gz 
python bin2txt.py GoogleNews-vectors-negative300.bin GoogleNews-vectors-negative300.txt 

@achyudh
Copy link
Member Author

achyudh commented May 29, 2020

I was trying to get hedwig running and then realized that I need gensim to run bin2txt.py, a dependency that's not included in requirements.txt as hedwig itself doesn't use gensim. Would just be better if we remove this step altogether

@achyudh
Copy link
Member Author

achyudh commented May 30, 2020

Fixed in #62

@achyudh achyudh closed this as completed May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants