Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Spark Optimization

Course Summary

This Spark Optimization training course is designed to cover advanced levels of Spark for tuning applications.

The course begins with a review of Spark including architecture, terms, and using Hadoop with Spark. From there, students will learn about the Spark execution environment and YARN; how to work with the right data format; and dealing with Spark partitions. The course concludes by exploring Spark physical execution, using the Spark Core API, caching and checkpointing, joins, and optimization.

The course is offered in Python/Scala programming languages.

Purpose
Learn best practices and techniques to optimize Spark Core and Spark SQL code.
Audience
Engineers looking to up-skill their Spark knowledge.
Role
Data Engineer - Software Developer
Skill Level
Advanced
Style
Workshops
Duration
3 Days
Related Technologies
Apache Spark | Hadoop

 

Productivity Objectives
  • Integrate aspects of Spark on YARN
  • Deal with Binary Data Formats
  • Identify the Internals of Spark
  • Optimize Spark Core and Spark SQL Code
  • Discuss best practices when writing Spark Core and Spark SQL Code

What You'll Learn:

In the Spark Optimization training course, you'll learn:
  • Spark Overview
    • Logical Architecture
    • Physical Architecture of Spark
    • Common Concepts and Terms in Spark
    • Ways to build applications on Spark
    • Spark with Hadoop
  • Understanding Spark Execution Environment - YARN
    • About YARN
    • Why YARN
    • Architecture of YARN
    • YARN UI and Commands
    • Internals of YARN
    • Experience execution of Spark application on YARN
    • Troubleshooting and Debugging Spark applications on YARN
    • Optimizing Application Performance
  • Working with Right Data Format
    • Why Data Formats are important for optimization
    • Key Data Formats
    • Comparisons - which one to choose when?
    • Working with Avro
    • Working with Parquet
    • Working with ORC
  • Dealing with Spark Partitions
    • How Spark determines number of Partitions
    • Things to keep in mind while determining Partition
    • Small Partitions Problem
    • Diagnosing & Handling Post Filtering Issues (Skewness)
    • Repartition vs Coalesce
  • Spark Physical Execution
    • Spark Core Plan
    • Modes of Execution
    • YARN Client vs YARN Cluster
    • Standalone Mode
    • Physical Execution on Cluster
    • Narrow vs Wide Dependency
    • Spark UI
    • Executor Memory Architecture
    • Key Properties
  • Effective Development Using Spark Core API
    • Use of groupbykey and reducebykey
    • Using the right datatype in RDD
    • How to ensure memory is utilized effectively?
    • Performing Data Validation in an optimal manner
    • Use of mapPartitions
    • Partitioning Strategies
    • Hash Partitioner
    • Use of Range Partitioner
    • Writing and plugging custom partitioner
  • Caching and Checkpointing
    • When to Cache?
    • How Caching helps?
    • Caching Strategies
    • How Spark plans changes when Caching is on
    • Caching on Spark UI
    • Role of Alluxio
    • Checkpointing
    • How Caching is different from Checkpointing
  • Joins
    • Why optimizing joins is important
    • Types of Joins
    • Quick Recap of MapReduce MapSide Joins
    • Broadcasting
    • Bucketing
  • Spark SQL Optimization
    • Dataframes vs Datasets
    • About Tungsten
    • Data Partitioning
    • Query Optimizer: Catalyst Optimizer
    • Debugging Spark Queries
    • Explain Plan
    • Partitioning & Bucketing in Spark SQL
    • Best Practices for writing Spark SQL code
    • Spark SQL with Binary Data formats
“I appreciated the instructor's technique of writing live code examples rather than using fixed slide decks to present the material.”

VMware

Dive in and learn more

When transforming your workforce, it's important to have expert advice and tailored solutions. We can help. Tell us your unique needs and we'll explore ways to address them.

Let's chat

By filling out this form and clicking submit, you acknowledge our privacy policy.