Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 1.17 KB

README.md

File metadata and controls

29 lines (21 loc) · 1.17 KB

PySpark PySpark tutorial

PySpark is a Python API for support Python with Spark. Whether it is to perform computations on large datasets or to just analyze them

Install pySpark

pip install pyspark

Distributed Processing Power of PySpark

Key Features of PySpark

Real-time computations:

Because of the in-memory processing in the PySpark framework, it shows low latency.

Polyglot:

The PySpark framework is compatible with various languages such as Scala, Java, Python, and R, which makes it one of the most preferable frameworks for processing huge datasets.

Caching and disk persistence:

This framework provides powerful caching and great disk persistence.

Fast processing:

The PySpark framework is way faster than other traditional frameworks for Big Data processing.

Works well with RDDs:

Python programming language is dynamically typed, which helps when working with RDDs(Resilient Distributed Datasets ).

RDDs (Resilient Distributed Datasets) –

RDDs are immutable collection of objects. Since we are using PySpark, these objects can be of multiple types. These will become more clear further.

STEPS:

Reading the data Cleaning data