Spark Optimization

Course Summary

This Spark Optimization training course is designed to cover advanced levels of Spark for tuning applications.

The course begins with a review of Spark including architecture, terms, and using Hadoop with Spark. From there, students will learn about the Spark execution environment and YARN; how to work with the right data format; and dealing with Spark partitions. The course concludes by exploring Spark physical execution, using the Spark Core API, caching and checkpointing, joins, and optimization.

The course is offered in Python/Scala programming languages.

Purpose	Learn best practices and techniques to optimize Spark Core and Spark SQL code.
Audience	Engineers looking to up-skill their Spark knowledge.
Role	Data Engineer - Software Developer
Skill Level	Advanced
Style	Workshops
Duration	3 Days
Related Technologies	Apache Spark \| Hadoop

Productivity Objectives

Integrate aspects of Spark on YARN
Deal with Binary Data Formats
Identify the Internals of Spark
Optimize Spark Core and Spark SQL Code
Discuss best practices when writing Spark Core and Spark SQL Code

What You'll Learn:

In the Spark Optimization training course, you'll learn:

Spark Overview
- Logical Architecture
- Physical Architecture of Spark
- Common Concepts and Terms in Spark
- Ways to build applications on Spark
- Spark with Hadoop
Understanding Spark Execution Environment - YARN
- About YARN
- Why YARN
- Architecture of YARN
- YARN UI and Commands
- Internals of YARN
- Experience execution of Spark application on YARN
- Troubleshooting and Debugging Spark applications on YARN
- Optimizing Application Performance
Working with Right Data Format
- Why Data Formats are important for optimization
- Key Data Formats
- Comparisons - which one to choose when?
- Working with Avro
- Working with Parquet
- Working with ORC
Dealing with Spark Partitions
- How Spark determines number of Partitions
- Things to keep in mind while determining Partition
- Small Partitions Problem
- Diagnosing & Handling Post Filtering Issues (Skewness)
- Repartition vs Coalesce
Spark Physical Execution
- Spark Core Plan
- Modes of Execution
- YARN Client vs YARN Cluster
- Standalone Mode
- Physical Execution on Cluster
- Narrow vs Wide Dependency
- Spark UI
- Executor Memory Architecture
- Key Properties
Effective Development Using Spark Core API
- Use of groupbykey and reducebykey
- Using the right datatype in RDD
- How to ensure memory is utilized effectively?
- Performing Data Validation in an optimal manner
- Use of mapPartitions
- Partitioning Strategies
- Hash Partitioner
- Use of Range Partitioner
- Writing and plugging custom partitioner
Caching and Checkpointing
- When to Cache?
- How Caching helps?
- Caching Strategies
- How Spark plans changes when Caching is on
- Caching on Spark UI
- Role of Alluxio
- Checkpointing
- How Caching is different from Checkpointing
Joins
- Why optimizing joins is important
- Types of Joins
- Quick Recap of MapReduce MapSide Joins
- Broadcasting
- Bucketing
Spark SQL Optimization
- Dataframes vs Datasets
- About Tungsten
- Data Partitioning
- Query Optimizer: Catalyst Optimizer
- Debugging Spark Queries
- Explain Plan
- Partitioning & Bucketing in Spark SQL
- Best Practices for writing Spark SQL code
- Spark SQL with Binary Data formats

Real-World Content

Project-focused demos and labs using your tool stack and environment, not some canned "training room" lab.

Expert Practitioners

Industry experts that bring their battle scars into the classroom.

Experiential Learning

More coding than lecture, coupled with architectural and design discussions.

Tailored Outlines

One-size-fits-all doesn't apply to training teams. That's where we come in!

“I appreciated the instructor's technique of writing live code examples rather than using fixed slide decks to present the material.”

VMware

Dive in and learn more

When transforming your workforce, it's important to have expert advice and tailored solutions. We can help. Tell us your unique needs and we'll explore ways to address them.

Let's chat

First Name*

Last Name*

Business Email*

Company*

Job Title*

Phone*

Country*

Tell us about what you’re looking to accomplish:

By filling out this form and clicking submit, you acknowledge our privacy policy.

Spark Optimization

Course Summary

Purpose

Audience

Role

Skill Level

Style

Duration

Related Technologies