GitHub is the largest code repository and home to the largest developer community in the world with 83+ million developers, 4+ million organizations, and 200+ million repositories on our platform. GitHub is where most of the world’s software development happens, and as such, data on GitHub activity can offer unique insights on developer collaboration, productivity, and innovation over time.
Below is a non-exhaustive list of the most salient entities for studying collaboration on GitHub:
- A repository contains a software project that uses the Git version control system. Available data include: dependency graph (a list of other software projects that a given repository relies on generated from the manifest files contained in the repository), copyright license, contributors (contributing users with emails for the top 500 most prolific), stars, watchers, programming languages, and more.
- A user can own repositories, participate as members of Organizations, and collaborate with other users via Issues, Pull Requests, Discussions, etc. Available data include: email, (self-identified) location, company, repositories, and more.
- An organization can own repositories and are made up of Users, whose membership can be public or private. Available data include: email, location, company, repositories, members, and more.
- A commit is a snapshot of the state of a project’s files at a point in time. Available data include: committer name, committer email, time committed, and more.
- A branch is a named variant of a repository with a given set of snapshots applied. Used to develop variations on the software project without changing the current version. Available data include: name, SHA-1 checksum, and more.
- A Pull Request is a request for an authorized user to accept a set of proposed changes to a branch. Available data include: title, body, time opened, time closed, time merged, comments, and more.
- An Issue is a free-form submission associated with a GitHub repository that is often used to track bugs, ask questions, and plan work. Available data include: title, body, comments, time created, time closed, and more.
- A Discussion is a collaborative communication forum associated with a GitHub repository that is geared toward more open-ended conversations than Issues. Available data include: title, body, comments, reactions, and more.
- McDermott, Grant R., and Benjamin Hansen. “Labor Reallocation and Remote Work During COVID-19: Real-time Evidence from GitHub.” NBER, 2021.
- Wright, Nagle & Greenstein. (2021). Open source software and global entrepreneurship: A virtuous cycle. Harvard Business School Working Paper, 20-139.
- Robbins, et al. (2018). Open source software as intangible capital: Measuring the cost and impact of free digital tools. 6th International Monetary Fund Statistical Forum.
Remaining Entities, relevant to research
- Actions: product documentation description; API data availability description
- Activity: product documentation description; API data availability description (Note: GH Torrent subscribes to this firehose)
- Codes of conduct: product documentation description; API data availability description
- Code Scanning: product documentation description; API data availability description
- Gists: product documentation description; API data availability description
- Reactions: emoji reactions for issues/pull requests/discussions; API data availability description
- Releases: product documentation description; API data availability description