-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect anonymous contributors #37
Conversation
I'm not really sure why it gives Jameson an anonymous response and most of the other top contributors non-anonymous. I wonder if that means the commits weren’t associated to a GH identity? in which case maybe omitting |
I think I want this, as long as it's easy to distinguish authenticated users and anonymous ones. We can provide the total number of authenticated users as safe lower bound, and the number of anonymous contributors with a huge caveat that there will be duplicated that aren't really easy to disentangle. I'm just not sure whether to carry the email addresses. |
The “type” field should make it easy to distinguish the two. The emails were to help deduplicate but I agree maybe we don’t want them if there’s issues around collecting that kind of info and maybe it’s still hard to deduplicate even with them... I can rm the emails and add tests and docs if we want to go this way? |
I've seen a few cases were anonymous users are simply GitHub users who committed with an email address not registered in their GitHub account (and for GitHub users we don't have email addresses anyway), I don't really see an easy way to deduplicate these cases. My feeling is that deduplication is anyway a massive mess even with lots of matching and guessing.
Sounds good! |
We didn't have any tests for contributors and I'm not really sure how to do it with the auth anyway (I guess we could look at mocking but maybe that's overkill here..). So I just removed the |
We can use env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} and have the test about contributor to be run only when |
Thanks! |
For repos with > 500 contributors, the GH api only returns 500 (though it seems like sometimes fewer, e.g. 404 for JuliaLang/julia).
This PR passes
anon=true
so that GH returns us all the contributors, but for some it will return GH info like their GH id and login, and for others it will return just a name and email, which is presumably not deduplicated if one used multiple names/emails but with the same GH identity.To handle the variety of responses, I switched the object we store to a table that uses missing entries for id/login or name/email depending on what we have.
Draft since it isn't totally clear if we want to go this way (now we probably have duplicate entries instead of missing entries) and I haven't updated tests or docs.
Example: