-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seed random number generators #70
Comments
Hi @eric-czech, thanks for the issue! This is an excellent suggestion and is good practice for reproducibility! I've wanted to implement this also #23, but as you said, not exactly sure how it will work with njit parallel=True. But I'll give it a try and see! |
Hi @eric-czech, I've created a PR that introduces the random state option for random walk generation #71. Could you check that out and see if that is sufficient? |
Thanks @RemyLau! I'll give it a try soon and report back. |
No good on #71 unfortunately (with a caveat): import networkx as nx
import pandas as pd
g = nx.random_geometric_graph(100, .1)
pd.DataFrame([
dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
for e in g.edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)
# Run once (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
# Run a second time (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
!cmp /tmp/edges1.emb /tmp/edges2.emb
# /tmp/edges1.emb /tmp/edges2.emb differ: byte 8, line 2 But it does work with only one worker now, which wasn't the case before: !pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!cmp /tmp/edges1.emb /tmp/edges2.emb
# All good Do you know if numba will let you pass a RandomState as an input (for node2vec_walks)? I wonder if that would work. |
🤔 actually I don't think that would work either even if it did let you do that (which it doesn't). |
@eric-czech I think this is might be an issue with gensim word2vec, but not the random walk generation. I've explicitly tested the reproducibility of the walks. I'm happy to find out how to control the gensim word2vec random seed also, but before that could you check to see if the walks (not necessarily the final embeddings) are consistent between runs? To generate walks, you could use the following (hopefully bug-free 🤞) code snippet: from pecanpy import pecanpy
g = pecanpy. FirstOrderUnweighted(random_state=1)
g.read_edg(path_to_edg, weighted=False, directed=False)
walks = g.simulate_walks(num_walks=10, walk_length=80) |
In terms of gensim word2vec random state, there's a
I think for now I'll just set the |
Gave it a shot but no luck. I tried: pd.DataFrame([
dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
for e in nx.random_geometric_graph(100, .1).edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)
g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks1 = g.simulate_walks(num_walks=10, walk_length=80)
g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks2 = g.simulate_walks(num_walks=10, walk_length=80)
(pd.Series(walks1) == pd.Series(walks2)).value_counts()
# False 900
# True 40
# dtype: int64 If I dump that into a script that just generates the walks from some pre-existing
Makes sense given Line 9 in 71dd988
Overall I suppose it's not that big of a deal if the Word2Vec part can't be parallelized. Thanks for checking that in the docs. It's a bummer though! Let me know if you find anything else but feel free to close this otherwise. |
For posterity, my example above with two runs in the same python process does work with |
Thanks a lot, @eric-czech! At the time being, I haven't come up with a good solution for taking care of this reproducibility issue with multi-threading yet. I'll keep this issue open for now, and hopefully, I'll be able to find something later to mitigate this (at least the random walk part). |
Awesome project @RemyLau, thanks for sharing your work on it!
Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.
I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with
@njit(parallel=True)
though. Have you already figured out how to make that work?The text was updated successfully, but these errors were encountered: