A little under 1M identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full description can be found in the "Algorithms" section of the sourced.ml repository.
Example:
from sourced.ml.models import Id2Vec
id2vec = Id2Vec().load("3467e9ca-ec11-444a-ba27-9fa55f5ee6c1")
print("Number of tokens:", len(id2vec))
ID | 3467e9ca-ec11-444a-ba27-9fa55f5ee6c1 |
Uploaded | 2018-07-19 13:14:53.000621 |
Version | 1.0.0 |
File | https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.asdf |
Size | 1.2 GB |
Data collection date | June 2018 |
Number of tokens | 999,424 |
Size of each embedding | 300 |
License |