Skip to content

Latest commit

 

History

History
35 lines (29 loc) · 5.88 KB

data_dictionary.md

File metadata and controls

35 lines (29 loc) · 5.88 KB

GitHub Data Dictionary

Introduction

GitHub is the largest code repository and home to the largest developer community in the world with 83+ million developers, 4+ million organizations, and 200+ million repositories on our platform. GitHub is where most of the world’s software development happens, and as such, data on GitHub activity can offer unique insights on developer collaboration, productivity, and innovation over time.

Software Collaboration Entities

Below is a non-exhaustive list of the most salient entities for studying collaboration on GitHub:

  • A repository contains a software project that uses the Git version control system. Available data include: dependency graph (a list of other software projects that a given repository relies on generated from the manifest files contained in the repository), copyright license, contributors (contributing users with emails for the top 500 most prolific), stars, watchers, programming languages, and more.
  • A user can own repositories, participate as members of Organizations, and collaborate with other users via Issues, Pull Requests, Discussions, etc. Available data include: email, (self-identified) location, company, repositories, and more.
  • An organization can own repositories and are made up of Users, whose membership can be public or private. Available data include: email, location, company, repositories, members, and more.
  • A commit is a snapshot of the state of a project’s files at a point in time. Available data include: committer name, committer email, time committed, and more.
  • A branch is a named variant of a repository with a given set of snapshots applied. Used to develop variations on the software project without changing the current version. Available data include: name, SHA-1 checksum, and more.
  • A Pull Request is a request for an authorized user to accept a set of proposed changes to a branch. Available data include: title, body, time opened, time closed, time merged, comments, and more.
  • An Issue is a free-form submission associated with a GitHub repository that is often used to track bugs, ask questions, and plan work. Available data include: title, body, comments, time created, time closed, and more.
  • A Discussion is a collaborative communication forum associated with a GitHub repository that is geared toward more open-ended conversations than Issues. Available data include: title, body, comments, reactions, and more.

Examples of Research Using GitHub Data

Relationship Diagram of Collaboration-related Entities

data_dictionary_relationship_diagram_light data_dictionary_relationship_diagram_dark

Appendix

Remaining Entities, relevant to research