Apache Spark and PySpark for Data Engineering and Big Data
Learn Apache Spark and PySpark to build scalable data pipelines, process big data, and implement effective ML workflows.
4.00 (2 reviews)

80
students
46 hours
content
Jan 2025
last update
$54.99
regular price
What you will learn
Understand Big Data Fundamentals: Explain the key concepts of big data and the evolution from Hadoop to Spark.
Learn Spark Architecture: Describe the core components and architecture of Apache Spark, including RDDs, DataFrames, and Datasets.
Set Up Spark: Install and configure Spark in local and standalone modes for development and testing.
Write PySpark Programs: Create and run PySpark applications using Python, including basic operations on RDDs and DataFrames.
Master RDD Operations: Perform transformations and actions on RDDs, such as map, filter, reduce, and groupBy, while leveraging caching and persistence.
Work with SparkContext and SparkSession: Understand their roles and effectively manage them in PySpark applications.
Work with DataFrames: Create, manipulate, and optimize DataFrames for structured data processing.
Run SQL Queries in SparkSQL: Use SparkSQL to query DataFrames and integrate SQL with DataFrame operations.
Handle Various Data Formats: Read and write data in formats such as CSV, JSON, Parquet, and Avro while optimizing data storage with partitioning and bucketing.
Build Data Pipelines: Design and implement batch and real-time data pipelines for data ingestion, transformation, and aggregation.
Learn Spark Streaming Basics: Process real-time data using Spark Streaming, including working with structured streaming and integrating with Kafka.
Optimize Spark Applications: Tune Spark applications for performance by understanding execution models, DAGs, shuffle operations, and memory management.
Leverage Advanced Spark Features: Utilize advanced DataFrame operations, including joins, aggregations, and window functions, for complex data transformations.
Explore Spark Internals: Gain a deep understanding of Spark’s execution model, Catalyst Optimizer, and techniques like broadcasting and partitioning.
Learn Spark MLlib Basics: Build machine learning pipelines using Spark MLlib, applying algorithms like linear regression and logistic regression.
Develop Real-Time Streaming Applications: Implement stateful streaming, handle late data, and manage fault tolerance with checkpointing in Spark Streaming.
Work on Capstone Projects: Design and implement an end-to-end data pipeline, integrating batch and streaming data processing with machine learning.
Prepare for Industry Roles: Apply Spark to real-world use cases, enhance resumes with Spark skills, prepare for technical interviews in data and ML engineering.
6266561
udemy ID
11/2/2024
course created date
12/23/2024
course indexed date
Bot
course submited by