Coursera

Introduction to Big Data with Spark and Hadoop (Coursera)

Offered by IBM,

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. In this course, you will learn about the characteristics of Big Data and its application in Big Data Analytics. You will gain an understanding about the features, benefits, limitations, and applications of some of the Big Data processing tools. You’ll explore how Hadoop and Hive help leverage the benefits of Big Data while overcoming some of the challenges it poses.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets residing in various databases and file systems that integrate with Hadoop.
Apache Spark is an open-source processing engine that provides users new ways to store and make use of big data. It is an open-source processing engine built around speed, ease of use, and analytics. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark.
In this course, you will also learn about Resilient Distributed Datasets, or RDDs, that enable parallel processing across the nodes of a Spark cluster.

This course is part of multiple programs
This course can be applied to multiple Specializations or Professional Certificates programs. Completing this course will count towards your learning in any of the following programs:

What You Will Learn

Deep insight into the impact of Big Data including use cases, tools, and processing methods.
Knowledge of the Apache Hadoop architecture, ecosystem, and practices, and the use of applications including HDFS, HBase, Spark, and MapReduce.
Know-how to apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.
Proficiency with Spark’s RDDs, data sets, use of Catalyst and Tungsten to optimize SparkSQL, and Spark’s development and runtime environment options.

Syllabus

WEEK 1
What is Big Data?
Begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. Learn how Big Data uses Parallel Processing, Scaling, and Data Parallelism. Learn about commonly used Big Data tools. Then, go beyond the hype and explore additional Big Data viewpoints.

WEEK 2
Introduction to the Hadoop Ecosystem
In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications including Distributed File System (HDFS), MapReduce, HIVE and HBase. Gain practical skills in this module's lab when you launch a single node Hadoop cluster using Docker and run MapReduce jobs.

WEEK 3
Apache Spark
Build your skills when you turn your attention to the popular Apache Spark platform. Explore attribute and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. Explore Resilient Distributed Datasets (RDDs), Parallel Programming, resilience in Apache Spark and relate RDDs and Parallel Programming with Apache Spark. Dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. Learn about the functions, parts and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with SparkSQL.

WEEK 4
DataFrames and SparkSQL
Learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. Explore Apache Spark SQL optimization. Learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Learn how to create a table view and apply data aggregation techniques. Fortify your skills guided via the hands-on lab.

WEEK 5
Development and Runtime Environment Options
Explore how Spark processes the requests that your application submits. Learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need be able to identify Apache Cluster Managers, their components, benefits, and know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, learn about Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’ and learn about options and dependencies. Developers now have the option of AIOps. Discover how to use Spark within AIOps and with Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’, describe and apply options for submitting applications, identify external application dependency management techniques and list Spark Shell benefits. View and see recommended practices for Spark's static and dynamic configuration options. Round out your development knowledge with insights about Spark on Kubernetes. This module features hands-on Spark labs using IBM Cloud and Kubernetes.

WEEK 6
Monitoring & Tuning
Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. Identify common Apache Spark application issues. Learn about debugging issues using the application UI and locating related log files. Discover and gain real-world knowledge about how Spark manages memory and processor resources via videos and in the available hands-on lab.

Go to Class

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Coursera

University of Illinois at Urbana-Champaign

Cloud Computing Applications, Part 2: Big Data and Applications in the Cloud (Coursera)

Security & Networking

Welcome to the Cloud Computing Applications course, the second part of a two-course series designed to give you a comprehensive view on the world of Cloud Computing and Big Data! In this second course we continue Cloud Computing Applications by exploring how the Cloud opens up data analytics of huge volumes of data that are static or streamed at high velocity and represent an enormous variety of information. Cloud applications and data analytics represent a disruptive change in the ways that society is informed by, and uses information.

Aug 3rd 2026

4 Weeks

Cloud Machine Learning Big Data

Coursera

FIA Business School

Cadeia de Suprimentos na Nuvem (Coursera)

Business

Nossas boas-vindas ao Curso Cadeia de Suprimentos na Nuvem. Neste curso, você aprenderá como o supply chain pode ampliar o valor da empresa explorando as diversas ferramentas disponíveis em cloud para potencializar a visibilidade e a responsividade da cadeia, melhorando o nível de serviço prestado aos clientes.

Aug 17th 2026

5-12 Weeks

Cloud Machine Learning Big Data

Coursera

IBM

Scalable Machine Learning on Big Data using Apache Spark (Coursera)

Data Science

This course will empower you with the skills to scale data science and machine learning (ML) tasks on Big Data sets using Apache Spark. Most real world machine learning work involves very large data sets that go beyond the CPU, memory and storage limitations of a single computer. Apache Spark is an open source framework that leverages cluster computing and distributed storage to process extremely large data sets in an efficient and cost effective manner. Therefore an applied knowledge of working with Apache Spark is a great asset and potential differentiator for a Machine Learning engineer.

Aug 3rd 2026

4 Weeks

Artificial Intelligence Machine Learning Big Data

Coursera

University of Illinois at Urbana-Champaign

Infonomics II: Business Information Management and Measurement (Coursera)

Business

Even decades into the Information Age, accounting practices yet fail to recognize the financial value of information. Moreover, traditional asset management practices fail to recognize information as an asset to be managed with earnest discipline. This has led to a business culture of complacence, and the inability for most organizations to fully leverage available information assets. This second course in the two-part Infonomics series explores how and why to adapt well-honed asset management principles and practices to information, and how to apply accepted and new valuation models to gauge information’s potential and realized economic benefits.

Aug 17th 2026

4 Weeks

Business Big Data Accounting

Coursera

Cloudera

Managing Big Data in Clusters and Cloud Storage (Coursera)

Statistics & Data Analysis Data Science

In this course, you'll learn how to manage big datasets, how to load them into clusters and cloud storage, and how to apply structure to the data so that you can run queries on it using distributed SQL engines like Apache Hive and Apache Impala. You’ll learn how to choose the right data types, storage systems, and file formats based on which tools you’ll use and what performance you need.

Aug 3rd 2026

5-12 Weeks

SQL Big Data Data Management

Coursera

University of California, San Diego

Graph Analytics for Big Data (Coursera)

Statistics & Data Analysis Data Science

Want to understand your data network structure and how it changes under different conditions? Curious to know how to identify closely interacting clusters within a graph? Have you heard of the fast-growing area of graph analytics and want to learn more? This course gives you a broad overview of the field of graph analytics so you can learn new ways to model, store, retrieve and analyze graph-structured data.

Aug 3rd 2026

5-12 Weeks

Big Data Data Analysis Graphs

Coursera

Northwestern University

The Importance of Listening (Coursera)

Marketing & Communication Business

In this second MOOC in the Social Marketing Specialization - "The Importance of Listening" - you will go deep into the Big Data of social and gain a more complete picture of what can be learned from interactions on social sites. You will be amazed at just how much information can be extracted from a single post, picture, or video.

Aug 3rd 2026

4 Weeks

Marketing Big Data Social Media

Coursera

IBM

Introduction to Data Engineering (Coursera)

CS: Information & Technology

This course introduces you to the core concepts, processes, and tools you need to know in order to get a foundational knowledge of data engineering. You will gain an understanding of the modern data ecosystem and the role Data Engineers, Data Scientists, and Data Analysts play in this ecosystem. The Data Engineering Ecosystem includes several different components. It includes disparate data types, formats, and sources of data.

Aug 3rd 2026

4 Weeks

Databases NoSQL SQL

Coursera

FIA Business School

Introdução ao Big Data (Coursera)

Statistics & Data Analysis Data Science

Este curso é indicado para profissionais que desejam entender de forma fácil o que é Big Data, conhecer algumas tecnologias de Big Data, ter acesso a algumas aplicações de Analytics, Internet das Coisas - IOT e de Big Data. Ao final do curso você será capaz de participar de um projeto de Big Data contribuindo com estratégias e direcionando o projeto para a escolha da adequada técnica de análise de dados.

Aug 10th 2026

4 Weeks

Big Data Applications IoT

Coursera

Rice University

Distributed Programming in Java (Coursera)

CS: Software Engineering CS: Programming

This course teaches learners (industry professionals and students) the fundamental concepts of Distributed Programming in the context of Java 8. Distributed programming enables developers to use multiple nodes in a data center to increase throughput and/or reduce latency of selected applications. By the end of this course, you will learn how to use popular distributed programming frameworks for Java programs, including Hadoop, Spark, Sockets, Remote Method Invocation (RMI), Multicast Sockets, Kafka, Message Passing Interface (MPI), as well as different approaches to combine distribution with multithreading.

Aug 3rd 2026

4 Weeks

Programming Java MPI

Coursera

Georgia Institute of Technology

Applications in Engineering Mechanics (Coursera)

Engineering

This course applies principles learned in my course “Introduction to Engineering Mechanics” to analyze real world engineering structures. You will need to have mastered the engineering fundamentals from that class in order to be successful in this course offering. This course addresses the modeling and analysis of static equilibrium problems with an emphasis on real world engineering systems and problem solving.

Aug 10th 2026

5-12 Weeks

Mechanics Engineering Engineering Mechanics

Coursera

Pohang University of Science and Technology - POSTECH

Introduction and Programming with IoT Boards (Coursera)

CS: Information & Technology CS: Programming

Internet of Things (IoT) is an emerging area of information and communications technology (ICT) involving many disciplines of computer science and engineering including sensors/actuators, communications networking, server platforms, data analytics and smart applications. IoT is considered to be an essential part of the 4th Industrial Revolution along with AI and Big Data. This course will be very useful to senior undergraduate and graduate students as well as engineers who are working in the industry.

Aug 3rd 2026

5-12 Weeks

Programming Big Data Syntax