Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.
/ sparky Public archive

[wip] Azure ML + Synapse + Spark = OSS Data Science & ML @ Scale

Notifications You must be signed in to change notification settings

lostmygithubaccount/sparky

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure ML and Spark

Introduction

This repo

This is an informal collection of demos around Spark on Azure ML via Azure Synapse. I do not know how to write code. Do not take a production dependency on code I write. Use Microsoft official repos and documentation instead.

Data overview

The data is a copy of the NOAA Integrated Surface Data (ISD) moved from Azure Open Datasets moved to the Azure ML workspace's default storage account.

The data is stored in both compressed parquet files and uncompressed CSV files which are ~20 GB and ~150 GB respectively. There are >1000 individual files. Loaded in a dataframe, the data is ~750 GB. There are ~1.4 B rows.

Prerequisites

Create a Synapse Spark Pool

Create and setup compute instance

Launch JupyterLab, Jupyter, or use in inline notebook editor

Clone repository

About

[wip] Azure ML + Synapse + Spark = OSS Data Science & ML @ Scale

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published