Introduction to Big Data with Spark and Hadoop (Coursera)

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. In this course, you will learn about the characteristics of Big Data and its application in Big Data Analytics. You will gain an understanding about the features, benefits, limitations, and applications of some of the Big Data processing tools. You’ll explore how Hadoop and Hive help leverage the benefits of Big Data while overcoming some of the challenges it poses.

Enroll in course

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets residing in various databases and file systems that integrate with Hadoop.

Apache Spark is an open-source processing engine that provides users new ways to store and make use of big data. It is an open-source processing engine built around speed, ease of use, and analytics. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark.

In this course, you will also learn about Resilient Distributed Datasets, or RDDs, that enable parallel processing across the nodes of a Spark cluster.

This course is part of multiple programs

This course can be applied to multiple Specializations or Professional Certificates programs. Completing this course will count towards your learning in any of the following programs:

- IBM Data Engineering Professional Certificate

- NoSQL, Big Data, and Spark Foundations Specialization

What You Will Learn

- Deep insight into the impact of Big Data including use cases, tools, and processing methods.

- Knowledge of the Apache Hadoop architecture, ecosystem, and practices, and the use of applications including HDFS, HBase, Spark, and MapReduce.

- Know-how to apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.

- Proficiency with Spark’s RDDs, data sets, use of Catalyst and Tungsten to optimize SparkSQL, and Spark’s development and runtime environment options.

Syllabus

WEEK 1

What is Big Data?

Begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. Learn how Big Data uses Parallel Processing, Scaling, and Data Parallelism. Learn about commonly used Big Data tools. Then, go beyond the hype and explore additional Big Data viewpoints.

WEEK 2

Introduction to the Hadoop Ecosystem

In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications including Distributed File System (HDFS), MapReduce, HIVE and HBase. Gain practical skills in this module's lab when you launch a single node Hadoop cluster using Docker and run MapReduce jobs.

WEEK 3

Apache Spark

Build your skills when you turn your attention to the popular Apache Spark platform. Explore attribute and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. Explore Resilient Distributed Datasets (RDDs), Parallel Programming, resilience in Apache Spark and relate RDDs and Parallel Programming with Apache Spark. Dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. Learn about the functions, parts and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with SparkSQL.

WEEK 4

DataFrames and SparkSQL

Learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. Explore Apache Spark SQL optimization. Learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Learn how to create a table view and apply data aggregation techniques. Fortify your skills guided via the hands-on lab.

WEEK 5

Development and Runtime Environment Options

Explore how Spark processes the requests that your application submits. Learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need be able to identify Apache Cluster Managers, their components, benefits, and know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, learn about Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’ and learn about options and dependencies. Developers now have the option of AIOps. Discover how to use Spark within AIOps and with Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’, describe and apply options for submitting applications, identify external application dependency management techniques and list Spark Shell benefits. View and see recommended practices for Spark's static and dynamic configuration options. Round out your development knowledge with insights about Spark on Kubernetes. This module features hands-on Spark labs using IBM Cloud and Kubernetes.

WEEK 6

Monitoring & Tuning

Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. Identify common Apache Spark application issues. Learn about debugging issues using the application UI and locating related log files. Discover and gain real-world knowledge about how Spark manages memory and processor resources via videos and in the available hands-on lab.

NoSQL Big Data and Spark Foundations Specialization

IBM Data Engineering Professional Certificate

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.