Skip to content

dadoskawina/PySpark_workshop

Repository files navigation

First steps with pySpark - workshop

Do you work with large amount of data? Are you curious how to analyze them effectively? If so, its time to start using Spark! This workshop will familiarize you with basic concepts of Spark and the PySpark library. You will learn about RDDs (basic data containers) and how to work with them using the map-reduce concept, how to transform the data and what the difference between transformations and actions is. You will practice solving concrete problems. Finally, you will learn how to visualize and evaluate your solution.


The aim of this workshop is to familiarize participants with basic concepts of Spark and the PySpark library. Workshop is divided into two parts (as specified in outline). The first part consists of introduction to Spark concept, RDDs and most important actions, transformations and aggregation functions. Second part with additional excercises will help you


Part I

  1. Setup
  2. What is PySpark and why to use it?
  3. DataFrames and RDDs
  4. Actions
  5. Transformations
  6. Caching DataFrames
  7. Debugging

Part II

  1. Basic exercises

  2. I/O

  3. Advanced exercises - word counting

About

First steps with pySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published