Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect anonymous contributors #37

Merged
merged 8 commits into from
Apr 11, 2021
Merged

collect anonymous contributors #37

merged 8 commits into from
Apr 11, 2021

Conversation

ericphanson
Copy link
Member

@ericphanson ericphanson commented Mar 21, 2021

For repos with > 500 contributors, the GH api only returns 500 (though it seems like sometimes fewer, e.g. 404 for JuliaLang/julia).

This PR passes anon=true so that GH returns us all the contributors, but for some it will return GH info like their GH id and login, and for others it will return just a name and email, which is presumably not deduplicated if one used multiple names/emails but with the same GH identity.

To handle the variety of responses, I switched the object we store to a table that uses missing entries for id/login or name/email depending on what we have.

Draft since it isn't totally clear if we want to go this way (now we probably have duplicate entries instead of missing entries) and I haven't updated tests or docs.

Example:

julia> AnalyzeRegistry.contribution_table("JuliaLang/julia"; auth) |> DataFrame
1277×6 DataFrame
  Row │ login            id       name             email                              type       contributions 
      │ String?          Int64?   String?          String?                            String     Int64         
──────┼────────────────────────────────────────────────────────────────────────────────────────────────────────
    1 │ JeffBezanson      744556  missing          missing                            User                9721
    2 │ StefanKarpinski   153596  missing          missing                            User                4238
    3missing          missing  Jameson Nash     jameson@juliacomputing.com         Anonymous           3429
    4 │ ViralBShah        744411  missing          missing                            User                2830
    5 │ Keno             1291671  missing          missing                            User                2019
    6 │ tkelman          5934628  missing          missing                            User                1724
    7 │ timholy          1525481  missing          missing                            User                1582
    8 │ yuyichao          712232  missing          missing                            User                1361
    9 │ kshyatt           828643  missing          missing                            User                1145
                                                                                     
 1270missing          missing  wfrgra           wfrgra@gmail.com                   Anonymous              1
 1271missing          missing  wgmitchener      noreply@github.com                 Anonymous              1
 1272missing          missing  willywalters5    41460481+willywalters5@users.nor  Anonymous              1
 1273missing          missing  woclass          inkydragon@users.noreply.github.  Anonymous              1
 1274missing          missing  wooglooskr       paul@milovanov.ca                  Anonymous              1
 1275missing          missing  wschildbach      wschildbach@fermi.franken.de       Anonymous              1
 1276missing          missing  yuebanyishenqiu  thisispwj@outlook.com              Anonymous              1
 1277missing          missing  zsoerenm         zsoerenm@hotmail.de                Anonymous              1
                                                                                              1260 rows omitted

@ericphanson
Copy link
Member Author

I'm not really sure why it gives Jameson an anonymous response and most of the other top contributors non-anonymous. I wonder if that means the commits weren’t associated to a GH identity? in which case maybe omitting anon=true means we would miss those contributions altogether.

@giordano
Copy link
Member

giordano commented Apr 9, 2021

I think I want this, as long as it's easy to distinguish authenticated users and anonymous ones. We can provide the total number of authenticated users as safe lower bound, and the number of anonymous contributors with a huge caveat that there will be duplicated that aren't really easy to disentangle. I'm just not sure whether to carry the email addresses.

@ericphanson
Copy link
Member Author

ericphanson commented Apr 9, 2021

The “type” field should make it easy to distinguish the two. The emails were to help deduplicate but I agree maybe we don’t want them if there’s issues around collecting that kind of info and maybe it’s still hard to deduplicate even with them...

I can rm the emails and add tests and docs if we want to go this way?

@giordano
Copy link
Member

giordano commented Apr 9, 2021

The emails were to help deduplicate but I agree maybe we don’t want them if there’s issues around collecting that kind of info and maybe it’s still hard to deduplicate even with them...

I've seen a few cases were anonymous users are simply GitHub users who committed with an email address not registered in their GitHub account (and for GitHub users we don't have email addresses anyway), I don't really see an easy way to deduplicate these cases. My feeling is that deduplication is anyway a massive mess even with lots of matching and guessing.

I can rm the emails and add tests and docs if we want to go this way?

Sounds good!

@ericphanson ericphanson marked this pull request as ready for review April 11, 2021 18:01
@ericphanson
Copy link
Member Author

We didn't have any tests for contributors and I'm not really sure how to do it with the auth anyway (I guess we could look at mocking but maybe that's overkill here..). So I just removed the email field, changed show to separate out anonymous users, and updated the docs.

@giordano
Copy link
Member

We can use GITHUB_TOKEN. For example in CI for the runtests step we can do

env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

and have the test about contributor to be run only when PackageAnalyzer.github_auth() is non-anonymous

@giordano
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants