datasets

https://www.openml.org/search?type=data

winequalityN come from: https://www.kaggle.com/datasets/shelvigarg/wine-quality-dataset data.csv comes from: https://www.kaggle.com/datasets/shree1992/housedata

cars.csv comes from: https://www.kaggle.com/datasets/abineshkumark/carsdata

Spotify Classifier: https://www.kaggle.com/datasets/geomack/spotifyclassification

https://sports-statistics.com/sports-data/sports-data-sets-for-data-modeling-visualization-predictions-machine-learning/
Michael Jordan and Shaquille O'Neil Career Stats: Classification (Win is the Target Variable)
NBA shot logs: Data on shots taken during the 2014-2015 season, which player took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more.
Diamonds : Multiclass Classification and/or Regression
This list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task 🎯 Targets: ‘carat’ or ‘price’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No

Abalone Dataset: Classification / Regression (Male/Female or Age) This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.
🎯 Target: ‘Rings’
🔗 Link: Kaggle
📦Dimensions: (4177, 9)
⚙Missing values: No

King County Real Estate Dataset:
This is the dataset for those who are still interested in real estate and house prices regression
🎯 Target: ‘price’
🔗 Link: Kaggle
📦Dimensions: (21613, 17)
⚙Missing values: Yes

Cancer death rate Dataset
This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables. These data were aggregated from a number of sources including the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. Most of the data preparation process can be veiwed here.
🎯 Target: ‘TARGET_deathRate’
🔗 Link: Data.world
📦Dimensions: (3047, 33)
⚙Missing values: Yes

Life Expectancy (WHO)
How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them
🎯 Target: ‘Life expectancy.’
🔗 Link: Kaggle
📦Dimensions: (2938, 21)
⚙Missing values: Yes

Car prices The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles.

🎯 Target: ‘selling_price’ 🔗 Link: Kaggle 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Binary classification

7️⃣. NBA rookie stats The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

🎯 Target: ‘TARGET_5Yrs’ 🔗 Link: Data.world 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Stroke prediction Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features: 🎯 Target: ‘stroke’ 🔗 Link: Kaggle 📦Dimensions: (5110, 11) ⚙Missing values: Yes

Water potability Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties: 🎯 Target: ‘Potability’ 🔗 Link: Kaggle 📦Dimensions: (3276, 10) ⚙Missing values: Yes

Smart grid stability This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):

🎯 Target: ‘stabf’ 🔗 Link: Kaggle 📦Dimensions: (60000, 13) ⚙Missing values: No

IBM HR analytics & employee attrition This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role): 🎯 Target: ‘Attrition’ 🔗 Link: Kaggle 📦Dimensions: (1470, 35) ⚙Missing values: No

Can I eat this mushroom? Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (8124, 23) ⚙Missing values: Yes

Banknote authentication Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (1372, 5) ⚙Missing values: No

Adult income dataset Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.: 🎯 Target: ‘income’ 🔗 Link: Kaggle 📦Dimensions: (48842, 15) ⚙Missing values: Yes

Multi-class classification datasets

Yeast classification This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species: 🎯 Target: ‘class_protein_localization’ 🔗 Link: OpenML 📦Dimensions: (1484, 9) ⚙Missing values: No

mlb_salaries_2014.csv Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.
disease_democ.csv Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.
gdp_pc.csv World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.
nations.csv Data from the World Bank Indicators portal, which is an incredibly rich resource. Contains the following fields:iso2c iso3c Two- and Three-letter codes for each country, assigned by the International Organization for Standardization.
oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
urls.xls A spreadsheet that we’ll use in webscraping.

Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:

pfizer.csv Payments made by Pfizer to doctors across the United States in the second half on 2009.
fda.csv Data on warning letters sent to doctors by the U.S. Food and Drug Administration, because of problems in the way in which they ran clinical trials testing experimental treatments. Contains the following variables:
food_stamps.csv U.S. Department of Agriculture data on the number of participants, in millions, and costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015.
kindergarten.csv Data from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015.
-gpd_pc.csv gdp_pc.csvt CSV file with World Bank data on GDP per capita for the world’s nations in 2014, plus ancillary file for QGIS to understand the data types for each field.
warming.csv NASA data on the annual average global temperature, from 1880 to 2015, compared the the average from 1951-1980.

Global Terrorism Database Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.

You can download the data from here: https://gtd.terrorismdata.com/, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.

The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field.

Do take care to read the Terms of Use and instructions for citing the source of the GTD data.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
images		images
regression		regression
AirBnB_NYC_2019.csv		AirBnB_NYC_2019.csv
BankNote_Authentication.csv		BankNote_Authentication.csv
CAR DETAILS FROM CAR DEKHO.csv		CAR DETAILS FROM CAR DEKHO.csv
Car details v3.csv		Car details v3.csv
IQR.png		IQR.png
Leads Data Dictionary.xlsx		Leads Data Dictionary.xlsx
Leads.csv		Leads.csv
Life Expectancy Data.csv		Life Expectancy Data.csv
LynchingsInUS.csv		LynchingsInUS.csv
LynchingsInUSCodebook.pdf		LynchingsInUSCodebook.pdf
README.md		README.md
Regression_housedata.csv		Regression_housedata.csv
Resampling.ipynb		Resampling.ipynb
Screen Shot 2017-01-25 at 10.20.38 AM.png		Screen Shot 2017-01-25 at 10.20.38 AM.png
Sherrif.org 2_7_18_2022_all_records.csv		Sherrif.org 2_7_18_2022_all_records.csv
Terrorism_Codebook.pdf		Terrorism_Codebook.pdf
UCI_Credit_Card.csv		UCI_Credit_Card.csv
WA_Fn-UseC_-HR-Employee-Attrition.csv		WA_Fn-UseC_-HR-Employee-Attrition.csv
abalone.csv		abalone.csv
adultIncome.csv		adultIncome.csv
adult_data.csv		adult_data.csv
bank-full.csv		bank-full.csv
bank.csv		bank.csv
cancer_reg.csv		cancer_reg.csv
car data.csv		car data.csv
cars.csv		cars.csv
churn.csv		churn.csv
data.csv		data.csv
dataset_185_yeast.csv		dataset_185_yeast.csv
diamonds.csv		diamonds.csv
disease_democ.csv		disease_democ.csv
drug200.csv		drug200.csv
esg.ipynb		esg.ipynb
fda.csv		fda.csv
food_stamps.csv		food_stamps.csv
gdp_pc.csv		gdp_pc.csv
globalterrorismdb_1993_to_2021.csv		globalterrorismdb_1993_to_2021.csv
gtd1993_0221dist.xlsx		gtd1993_0221dist.xlsx
healthcare-dataset-stroke-data.csv		healthcare-dataset-stroke-data.csv
healthcare_facilities.csv		healthcare_facilities.csv
hotel_bookings.csv		hotel_bookings.csv
house_price_prediction.csv		house_price_prediction.csv
housing.csv		housing.csv
ionosphere.csv		ionosphere.csv
iris.csv		iris.csv
kc-house-data.csv		kc-house-data.csv
kc_house_data.csv		kc_house_data.csv
kindergarten.csv		kindergarten.csv
michael-jordan-nba-career-regular-season-stats-by-game.csv		michael-jordan-nba-career-regular-season-stats-by-game.csv
mlb_salaries_2014.csv		mlb_salaries_2014.csv
mtcars.csv		mtcars.csv
mushrooms.csv		mushrooms.csv
nations.csv		nations.csv
nations_withLifeExpectancy.csv		nations_withLifeExpectancy.csv
nba_logreg.csv		nba_logreg.csv
nba_logreg_original.csv		nba_logreg_original.csv
oil_production.csv		oil_production.csv
pfizer.csv		pfizer.csv
pima-indians-diabetes2.data.csv		pima-indians-diabetes2.data.csv
refine_geocoder.json		refine_geocoder.json
sf_test_addresses.tsv		sf_test_addresses.tsv
sf_test_addresses_short.tsv		sf_test_addresses_short.tsv
shaq-nba-career-regular-season-stats-by-game.csv		shaq-nba-career-regular-season-stats-by-game.csv
shot_logs.csv		shot_logs.csv
smart_grid_stability_augmented.csv		smart_grid_stability_augmented.csv
spotify_classifier.csv		spotify_classifier.csv
ta_lib.ipynb		ta_lib.ipynb
ucb_stanford_2014.csv		ucb_stanford_2014.csv
urls.xls		urls.xls
warming.csv		warming.csv
water_potability.csv		water_potability.csv
winequalityN.csv		winequalityN.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasets

Binary classification

Multi-class classification datasets

About

Releases

Packages

Languages

apoorvaa30/data-analytics-datasets

Folders and files

Latest commit

History

Repository files navigation

datasets

Binary classification

Multi-class classification datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages