Seed random number generators #70

eric-czech · 2022-02-11T13:06:33Z

Awesome project @RemyLau, thanks for sharing your work on it!

Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.

I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with @njit(parallel=True) though. Have you already figured out how to make that work?

The text was updated successfully, but these errors were encountered:

RemyLau · 2022-02-11T14:23:14Z

Hi @eric-czech, thanks for the issue! This is an excellent suggestion and is good practice for reproducibility! I've wanted to implement this also #23, but as you said, not exactly sure how it will work with njit parallel=True. But I'll give it a try and see!

RemyLau · 2022-02-11T21:21:08Z

Hi @eric-czech, I've created a PR that introduces the random state option for random walk generation #71. Could you check that out and see if that is sufficient?

eric-czech · 2022-02-12T11:40:26Z

Thanks @RemyLau! I'll give it a try soon and report back.

eric-czech · 2022-02-16T00:43:12Z

No good on #71 unfortunately (with a caveat):

import networkx as nx
import pandas as pd
g = nx.random_geometric_graph(100, .1)
pd.DataFrame([
    dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
    for e in g.edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)

# Run once (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
# Run a second time (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
!cmp /tmp/edges1.emb /tmp/edges2.emb
# /tmp/edges1.emb /tmp/edges2.emb differ: byte 8, line 2

But it does work with only one worker now, which wasn't the case before:

!pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!cmp /tmp/edges1.emb /tmp/edges2.emb
# All good

Do you know if numba will let you pass a RandomState as an input (for node2vec_walks)? I wonder if that would work.

eric-czech · 2022-02-16T00:53:22Z

🤔 actually I don't think that would work either even if it did let you do that (which it doesn't).

RemyLau · 2022-02-16T01:03:16Z

@eric-czech I think this is might be an issue with gensim word2vec, but not the random walk generation. I've explicitly tested the reproducibility of the walks. I'm happy to find out how to control the gensim word2vec random seed also, but before that could you check to see if the walks (not necessarily the final embeddings) are consistent between runs?

To generate walks, you could use the following (hopefully bug-free 🤞) code snippet:

from pecanpy import pecanpy

g = pecanpy. FirstOrderUnweighted(random_state=1)
g.read_edg(path_to_edg, weighted=False, directed=False)
walks = g.simulate_walks(num_walks=10, walk_length=80)

RemyLau · 2022-02-16T01:17:12Z

In terms of gensim word2vec random state, there's a seed parameter that we can set for this purpose. However, they do note that to fully ensure the deterministic and reproducible result, we need to do two things:

Use single thread
Set PYTHONHASHSEED environment variable before launching Python

I think for now I'll just set the seed parameter with the one specified to the pecanpy cli, and it's up to the user to do the two things above... I'll try to see if I can get consistent results by doing these.

eric-czech · 2022-02-16T13:52:17Z

To generate walks, you could use the following (hopefully bug-free 🤞) code snippet

Gave it a shot but no luck. I tried:

pd.DataFrame([
    dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
    for e in nx.random_geometric_graph(100, .1).edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)

g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks1 = g.simulate_walks(num_walks=10, walk_length=80)

g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks2 = g.simulate_walks(num_walks=10, walk_length=80)

(pd.Series(walks1) == pd.Series(walks2)).value_counts()
# False    900
# True      40
# dtype: int64

If I dump that into a script that just generates the walks from some pre-existing /tmp/edges.csv, then I can get identical walks with numba.set_num_threads(1) in the beginning of the script. It doesn't work with any more workers/threads though.

I've explicitly tested the reproducibility of the walks

Makes sense given

PecanPy/test/test_walk.py

Line 9 in 71dd988

set_num_threads(1)

.

Overall I suppose it's not that big of a deal if the Word2Vec part can't be parallelized. Thanks for checking that in the docs. It's a bummer though!

Let me know if you find anything else but feel free to close this otherwise.

eric-czech · 2022-02-16T14:00:57Z

For posterity, my example above with two runs in the same python process does work with numba.set_num_threads(1) first.

RemyLau · 2022-02-16T14:17:15Z

Thanks a lot, @eric-czech! At the time being, I haven't come up with a good solution for taking care of this reproducibility issue with multi-threading yet. I'll keep this issue open for now, and hopefully, I'll be able to find something later to mitigate this (at least the random walk part).

RemyLau linked a pull request Feb 11, 2022 that will close this issue

Random seed for random walk generation #71

Merged

RemyLau removed a link to a pull request Feb 12, 2022

Random seed for random walk generation #71

Merged

RemyLau added the help wanted Extra attention is needed label Feb 16, 2022

RemyLau mentioned this issue Feb 16, 2022

Attach random seed to gensim word2vec #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed random number generators #70

Seed random number generators #70

eric-czech commented Feb 11, 2022

RemyLau commented Feb 11, 2022

RemyLau commented Feb 11, 2022

eric-czech commented Feb 12, 2022

eric-czech commented Feb 16, 2022

eric-czech commented Feb 16, 2022 •

edited

Loading

RemyLau commented Feb 16, 2022 •

edited

Loading

RemyLau commented Feb 16, 2022 •

edited

Loading

eric-czech commented Feb 16, 2022

eric-czech commented Feb 16, 2022

RemyLau commented Feb 16, 2022

Seed random number generators #70

Seed random number generators #70

Comments

eric-czech commented Feb 11, 2022

RemyLau commented Feb 11, 2022

RemyLau commented Feb 11, 2022

eric-czech commented Feb 12, 2022

eric-czech commented Feb 16, 2022

eric-czech commented Feb 16, 2022 • edited Loading

RemyLau commented Feb 16, 2022 • edited Loading

RemyLau commented Feb 16, 2022 • edited Loading

eric-czech commented Feb 16, 2022

eric-czech commented Feb 16, 2022

RemyLau commented Feb 16, 2022

eric-czech commented Feb 16, 2022 •

edited

Loading

RemyLau commented Feb 16, 2022 •

edited

Loading

RemyLau commented Feb 16, 2022 •

edited

Loading