Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Advanced Apache Spark

Course Summary

The Advanced Apache Spark training course is designed to deeply explore Apache Spark.

The course begins with a review of core Apache Spark concepts followed by lesson on understanding Spark internals for performance. Next, it discusses the new features of Spark 2 and how to use them. The course concludes with lessons on advanced Spark SQL streaming, high performance Spark applications and best practices.

Purpose
Learn how to use Spark internals for working with NoSQL databases as well debugging and troubleshooting.
Audience
Developers who have taken the introduction to Spark or who have equivalent experience.
Role
Data Engineer - Data Scientist - Software Developer
Skill Level
Intermediate
Style
Fast Track - Targeted Topic - Workshops
Duration
4 Days
Related Technologies
Apache Spark | NoSQL | Apache

 

Productivity Objectives
  • Apply the Apache Spark fundamentals to gain a deeper understanding of Spark internals
  • Identify the operational tweaks to gain the maximum performance from Spark
  • Describe how to use GraphX and MLib for machine learning

What You'll Learn:

In the Advanced Apache Spark training course, you'll learn:
  • Review of core Apache Spark concepts
    • How Spark works
    • RDD Fundamentals
    • SparkSQL and DataFrames
    • Spark Streaming concepts
    • Machine Learning basics
  • Understanding Spark Internals for Performance
    • Schedules, jobs, and tasks
    • Data structures, sets and lakes
    • Shuffle and performance
    • Understanding data sources and partitions
    • Read, writes and performance
  • New Features of Spark 2
    • API Stability
    • Core and Spark SQL changes
    • Changes to packaging and operations
  • Working with Spark
    • Debugging/troubleshooting Spark apps
    • Developing data workflows
    • Automated Spark builds using Maven
  • Clustering with Spark
    • Running a Spark cluster
    • Understanding cluster resource requirements
    • Managing Memory on ExecutorsWorker
    • Managing memory/cores across a spark cluster
    • Performance tuning
    • Clarifying best practices
  • Spark Integration
    • Implementing Spark on DataStax, Hortonworks etc.
    • Integrating with Cassandra
    • Integrating with Kafka
    • Integrating with Elassticsearch
    • Integrating with other compatible NoSQL implementations (as desired)
  • Machine Learning with Spark
    • Common algorithms
    • Commonly used algorithms with Scala
    • Machine learning libraries: MLLib, H20
    • Custom algorithms creation
  • Advanced Spark SQL and Spark Streaming
    • Leveraging Spark 2 API (Spark Session etc)
    • Developing with Spark Dataframes
    • Writing sollid spark jobs
    • Understanding when to use spark and when to not use spark
  • High Performance Spark applications
    • Performance tuning process
    • Performance tuning metrics
    • SQL performance tuning
    • High performant caching strategies
    • Cluster resource requirements
    • Creating fault-tolerance
“I appreciated the instructor's technique of writing live code examples rather than using fixed slide decks to present the material.”

VMware

Dive in and learn more

When transforming your workforce, it's important to have expert advice and tailored solutions. We can help. Tell us your unique needs and we'll explore ways to address them.

Let's chat

By filling out this form and clicking submit, you acknowledge our privacy policy.