Spark Optimization

Learn to work with Apache Spark in the right manner

The Spark Optimization training course is an advanced level course for tuning Spark applications. The course begins with a review of Spark including architecture, terms and using Hadoop with Spark. From there, students will learn about the Spark execution environment and YARN; how to work with the right data format; and dealing with Spark partitions. The course then explores Spark physical execution, using the Spark Core API, caching and checkpointing, joins and optimization. It is a practical course which is focussed towards providing hands-on exposure to participants for tuning Spark applications.

The course is offered in Python/Scala* programming languages. Java will require additional effort as Python/Scala are more common in Spark.

Course Summary

Purpose: 
Learn best practices and techniques to optimize Spark Core and Spark SQL code.
Audience: 
Engineers looking to upskill their Spark skills
Skill Level: 
Learning Style: 

Workshops are instructor-led lab-intensives focused on the practical application of technologies through the facilitation of a project-related lab. Workshops are just the opposite of Seminars. They deliver the highest level of knowledge transfer of any format. Think wide (breadth) and deep (depth).

Workshop help
Duration: 
3 Days
Productivity Objectives: 
  • Integrate aspects of Spark on YARN
  • Deal with Binary Data Formats
  • Identify the Internals of Spark
  • Optimize Spark Core and Spark SQL Code
  • Discuss best practices when writing Spark Core and Spark SQL Code

What You'll Learn

In the Spark Optimization training course you’ll learn:

  • Spark Overview
    • Logical Architecture
    • Physical Architecture of Spark
    • Common Concepts and Terms in Spark
    • Ways to build applications on Spark
    • Spark with Hadoop
  • Understanding Spark Execution Environment – YARN
    • About YARN
    • Why YARN
    • Architecture of YARN
    • YARN UI and Commands
    • Internals of YARN
    • Experience execution of Spark application on YARN
    • Troubleshooting and Debugging Spark applications on YARN
    • Optimizing Application Performance
  • Working with Right Data format
    • Why Data Formats are important for optimization?
    • Key Data Formats
    • Comparisons – which one to choose when?
    • Working with Avro
    • Working with Parquet
    • Working with ORC
  • Dealing with Spark Partitions
    • How Spark determines number of Partition?
    • Things to keep in mind while determining Partition
    • Small Partitions Problem
    • Diagnosing & Handling Post Filtering Issues (Skewness)
    • Repartition vs Coalesce
  • Spark Physical Execution
    • Spark Core Plan
    • Modes of Execution
    • YARN Client vs YARN Cluster
    • Standalone Mode
    • Physical Execution on Cluster
    • Narrow vs Wide Dependency
    • Spark UI
    • Executor Memory Architecture
    • Key Properties
  • Effective Development using Spark Core API
    • Use of groupbykey and reducebykey
    • Using right datatype in RDD
    • How to ensure memory is utilized effectively?
    • Performing Data Validation in optimal manner
    • Use of mapPartitions
    • Partitioning Strategies
    • Hash Partitioner
    • Use of Range Partitioner
    • Writing and plugging custom partitioner
  • Caching and Checkpointing
    • When to Cache?
    • How Caching helps?
    • Caching Strategies
    • How Spark plan changes when Caching is on?
    • Caching on Spark UI
    • Role of Alluxio
    • Checkpointing
    • How Caching is different from Checkpointing?
  • Joins
    • Why optimizing joins is important?
    • Types of Joins
    • Quick Recap of MapReduce MapSide Joins
    • Broadcasting
    • Bucketing
  • Spark SQL Optimization
    • Dataframes vs Datasets
    • About Tungsten
    • Data Partitioning
    • Query Optimizer: Catalyst Optimizer
    • Debugging Spark Queries
    • Explain Plan
    • Partitioning & Bucketing in Spark SQL
    • Best Practices for writing Spark SQL code
    • Spark SQL with Binary Data formats

Get Custom Training Quote

We'll work with you to design a custom Spark Optimization training program that meets your specific needs. A 100% guaranteed plan that works for you, your team, and your budget.

Learn More

Chat with one of our Program Managers from our Boulder, Colorado office to discuss various training options.

DevelopIntelligence has been in the technical/software development learning and training industry for nearly 20 years. We’ve provided learning solutions to more than 48,000 engineers, across 220 organizations worldwide.

About Develop Intelligence
Di Clients
Need help finding the right learning solution?   Call us: 877-629-5631