Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Scalable Machine Learning

Course Summary

The Scalable Machine Learning (SML) course is designed and developed to provide students with exposure in Scalable Machine learning. The course focuses on utilizing the Hadoop and Spark Frameworks to implement SML Algorithms via Scala and Python programming languages.

The course begins with an introduction to SML and why developers use Spark for SML Next, the course dives into data acquisition, data pre-processing for modeling, and working with Iterative algorithms. The course concludes with model evaluation, optimization and deployment.

Purpose
Learn about and build end-to-end SML pipelines for gaining actionable insights.
Audience
Teams needing to gracefully scale up their Machine Learning projects.
Role
Data Engineer - Data Scientist - Software Developer
Skill Level
Intermediate
Style
Hack-a-thon - Learning Spikes - Workshops
Duration
3 Days
Related Technologies
Apache Spark | Hadoop | Python

 

Productivity Objectives
  • Describe the role of Spark in Machine Learning.
  • Apply Machine learning on massive datasets.
  • Demonstrate experience in Data Acquisition, Processing, Analysis and Modeling using Hadoop and Spark.
  • Evaluate various common types of data e.g. CSV, XML, JSON, Social Media data, etc. for pre-processing and/or building Machine Learning Models using Spark.
  • Train, tune, test and deploy Machine Learning Models.

What You'll Learn:

In the Scalable Machine Learning training course, you'll learn:
  • Introduction to SML
    • What is SML?
    • Why it is required?
    • Key platforms for performing SML
    • SMLProject End to End Pipeline
    • Spark Introduction
    • Why Spark for SML?
    • Databricks Platform Demo
    • Approaches for scaling sci-kit learn code
    • Hands-on Exercise(s): Experiencing the first notebook
  • Why Spark for SML?
    • Problems with Traditional Machine Learning Frameworks
    • Machine Learning at Scale - Various options
    • Iterative Algorithms
    • How Spark performs well for Iterative Machine Learning Algorithms?
    • Hands-on Exercise(s)
  • SML on Enterprise Platform
    • Quick Recap/Introduction to Hadoop
    • Logical View of Cloudera Distribution
    • Big Data Analytics Pipelines
    • Components in Cloudera Distribution for performing SML
    • Hands-on Exercise(s)
  • Data Acquisition at Scale
    • Acquiring Structured content from Relational Databases
    • Acquiring Semi-structured content from Log Files
    • Acquiring Unstructured content from other key sources like Web
    • Tools for Performing Data acquisition at Scale
    • Sqoop, Flume and Kafka Introduction, use cases and architectures
    • Hands-on Exercise(s)
  • Data Pre-Processing for Modeling
    • Using the Spark Shell
    • Resilient Distributed Datasets (RDDs)
    • Functional Programming with Spark
    • RDD Operations
    • Key-Value Pair RDDs
    • MapReduce and Pair RDD Operations
    • Building and Running a Spark Application
    • Performing Data Validation
    • Data De-Duplication
    • Detecting Outliers
    • Hands-on Exercise(s)
  • Working with Iterative Algorithms
    • Dealing with RDD Infinite Lineages
    • Caching Overview
    • Distributed Persistence
    • Checkpointing of an Iterative Machine Learning Algorithm
    • Hands-on Exercise(s)
  • Spark SQL
    • Introduction
    • Dataframe API
    • Performing ad-hoc query analysis using Spark SQL
    • Hands-on Exercise(s)
  • Spark Machine Learning Using MLLib
    • Spark ML vs Spark MLLib
    • Data types and key terms
    • Feature Extraction
    • Linear Regression using Spark MLLib
    • Hands-on Exercise(s)
  • Spark Machine Learning Using ML
    • Spark ML Overview
    • Transformers and Estimators
    • Pipelines
    • Implementing Decision Trees
    • K-Means Clustering using Spark ML
    • Hands-on Exercise(s)
  • Decision Trees and Random Forest
    • Types - Classification and Regression trees
    • Gini Index, Entropy and Information Gain
    • Building Decision Trees
    • Pruning the trees
    • Prediction using Trees
    • Ensemble Models
    • Bagging and Boosting
    • Advantages of using Random Forest
    • Working with Random Forest
    • Ensemble Learning
    • How ensemble learning works
    • Building models using Bagging
    • Random Forest algorithm
    • Random Forest model building
    • Fine tuning hyper-parameters
    • Hands-on Exercise(s)
  • Model Evaluation, Optimization and Deployment
    • Model Evaluation
    • Optimizing a Model
    • Deploying Model
    • Best Practices
“I appreciated the instructor's technique of writing live code examples rather than using fixed slide decks to present the material.”

VMware

Dive in and learn more

When transforming your workforce, it's important to have expert advice and tailored solutions. We can help. Tell us your unique needs and we'll explore ways to address them.

Let's chat

By filling out this form and clicking submit, you acknowledge our privacy policy.