Cover image for Learning PySpark

Learning PySpark

Data Processing with Python and Apache Spark

Tomasz Drabas

Created by Tomasz Drabas

Explore how to process large datasets using Python and Apache Spark. You'll gain practical experience with data collection, manipulation, and distributed processing, focusing on both RDDs and DataFrames. By the end, you'll know how to efficiently handle big data workflows using Spark's powerful features.

Packt | Feb 2018 | 148 min

Start Trial
LevelExpert
CategoriesData Engineering, Data Warehousing and Big Data Processing Frameworks, Spark, Python

What You Will Learn

You will build your skills through hands-on exercises that guide you from setting up your Spark environment to processing real data. Step-by-step examples show how to create and manipulate RDDs and DataFrames, perform key transformations, and use SQL for advanced queries. Each concept is reinforced with practical applications.

Key Features

  • Work with RDDs and DataFrames to manage and transform large datasets
  • Read data from files and HDFS, and define schemas programmatically
  • Use Spark SQL to query and analyze distributed data efficiently

Target Audience

Designed for Python developers who want to expand their data processing skills using Apache Spark. If you already know Python and want to work with distributed data or optimize big data workflows, you'll find clear, actionable guidance here. Some prior exposure to Spark is helpful but not required.

Related courses