Skip to content

Data analysis & visualization of "2020 Kaggle Machine Learning & Data Science Survey" to highlight the tools and best practices used by the Data Science Practitioners along with a blog post on Medium.

Notifications You must be signed in to change notification settings

Ankit-Kumar-Saini/Data_Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. Dependencies
  2. Project Introduction
  3. Business Understanding
  4. File Descriptions
  5. Results
  6. Licensing, Authors, and Acknowledgements

List of Dependencies

The code should run with no issues using Python versions 3. Other libraries used in this project are:

  • scikit-learn
  • numpy
  • matplotlib
  • seaborn
  • pandas

Project Introduction

Every year from 2017, Kaggle conducts an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October and had 20,036 responses from over 55 countries and diverse demographics answering a wide range of questions ranging from frequently used ML algorithms, frameworks, cloud platforms, and products to the preferred programming languages and many others.

Business Understanding

As more and more companies are entering the digital world, the role of data science is becoming very important for the growth and development of these companies. Therefore the demand of data science practitioners will continue to rise. But Data Science as a field is also continuously evolving with new tools entering into the field every now and then. Hence, it becomes extremely important to uderstand the current tools and practices of the field for the aspiring data scientists in order to enter into the field. In order to understand the current trends, tools, frameworks and practices existing in the field of Data Science, I have carried out the data analysis on the Kaggle 2020 Data Science and Machine Learning Survey (dataset) by answering 43 questions through data and visualization.

Some of the questions are:

  1. What is the highest level of formal education attained by the practitioners in the survey?
  2. Which programming languages do the data science practitioners use on a regular basis?
  3. Which integrated development environments (IDE's) do the data science practitioners use on a regular basis?
  4. Which data visualization libraries or tools do data science practitioners use on a regular basis?
  5. What are the job titles of Data Science practitioners?
  6. Is there a difference in salary of data science practitioners in India and USA? Is there a correlation between education status and salary?

File Descriptions

There is a single notebook available here to showcase work related to the above questions. The notebook contains 4 different visualization sections.

  1. Part 1: Insights from demographic responses of data science practitioners
  2. Part 2: Tools used by the data science practitioners on regular basis
  3. Part 3: The skills which data science practitioners want to acquire in the coming next 2 years
  4. Part 4: Bivariate Analysis on specific columns (Comparison between India and USA)

The notebooks is self explanatory with necessary Markdown cells provided to guide through the notebook.

There is an additional sample_images folder that contains the images of visualizations from the notebook for the purpose of quick demonstration of key findings in the results section below.

Results

The key insights from the code can be found at the post available here.

Some visualizations from the data science survey

  • Highest level of formal education

alt text

  • Programming Languages

alt text

  • IDEs

alt text

  • Visualization tools/libraries

alt text

  • Machine Learning Frameworks

alt text

  • India vs USA (Salary Comparison)

alt text

  • India vs USA (Coding experience Comparison)

alt text

  • India vs USA (Education Status Comparison)

alt text

  • Job Titles

alt text

Licensing, Authors, Acknowledgements

Must give credit to Kaggle for the data and python 3 notebook. You can find the Licensing for the data and other descriptive information at the Kaggle link available here. Otherwise, feel free to use the code here as you would like!

About

Data analysis & visualization of "2020 Kaggle Machine Learning & Data Science Survey" to highlight the tools and best practices used by the Data Science Practitioners along with a blog post on Medium.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published