Coursera

ETL and Data Pipelines with Shell, Airflow and Kafka (Coursera)

Offered by IBM,

After taking this course, you will be able to describe two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both.
You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for importing data into data repositories. You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure.
Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

Course 11 of 13 in the IBM Data Engineering Professional Certificate

Syllabus

WEEK 1
Data Processing Techniques
ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences been similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight.
You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

WEEK 2
ETL & Data Pipelines: Tools and Techniques
Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

WEEK 3
Building Data Pipelines using Airflow
The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators.
In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

WEEK 4
Building Streaming Pipelines using Kafka
Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines.
In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

WEEK 5
Final Assignment
In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios.
You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

Go to Class

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Coursera

Imperial College London

Customising your models with TensorFlow 2 (Coursera)

Data Science

Welcome to this course on Customising your models with TensorFlow 2! In this course you will deepen your knowledge and skills with TensorFlow, in order to develop fully customised deep learning models and workflows for any application. You will use lower level APIs in TensorFlow to develop complex model architectures, fully customised layers, and a flexible data workflow. You will also expand your knowledge of the TensorFlow APIs to include sequence models.

Aug 3rd 2026

5-12 Weeks

Machine Learning Modeling APIs

EdX

AI (Pragmatic AI Labs)

Cloud Data Engineering (edX)

Computer Science

Master data engineering for cloud-native applications through distributed systems, big data, and serverless technologies.

Self Paced

Self-Paced

Cloud Computing ETL Cloud Storage

EdX

AI (Pragmatic AI Labs)

Advanced Data Engineering (edX)

Computer Science

Become an expert in scaling data systems. Master Celery, Airflow, graph databases. Build real-world solutions for massive datasets and complex workflows. Optimize performance at enterprise scale.

Self Paced

Self-Paced

MySQL Scalability Workflow Management

Coursera

IBM

Relational Database Administration (DBA) (Coursera)

Management & Leadership CS: Information & Technology

Ongoing and proactive management is critical to the security and performance of database management systems. Database administration is the function of managing the operational aspects of database systems and maintaining them. Database administrators work to ensure that applications make the most efficient use of databases and that physical resources are used adequately and efficiently.

Jul 27th 2026

5-12 Weeks

Databases Database Management Relational Databases

OpenSAP

SAP

Freedom of Data with SAP Data Hub (OpenSAP)

CS: Information & Technology Data Science

Join this free open online course to learn about SAP Data Hub. The course will provide you with an overview of the architecture as well as the installation/deployment options, and is aimed at application developers, data warehouse modelers, data engineers, data scientists, and technical business analysts.

Self Paced

Self-Paced

Data Big Data Data Integration

Coursera

Whizlabs

AWS Data Processing (Coursera)

Statistics & Data Analysis Data Science

AWS: Data Processing Course is the second course of AWS Certified Data Analytics Specialty Specialization. This course focuses on providing data processing solutions. The entire course is designed to teach learners the concept of EMR and Extract, Transform and Load. This course also put emphasis on ETL services and Data Processing solutions in AWS.

Jul 22nd 2024

3 Weeks

Data Analysis ETL Data Processing

Coursera

University of California, Davis

Healthcare Data Models (Coursera)

Health & Society CS: Information & Technology

Career prospects are bright for those qualified to work in healthcare data analytics. Perhaps you work in data analytics, but are considering a move into healthcare where your work can improve people’s quality of life. If so, this course gives you a glimpse into why this work matters, what you’d be doing in this role, and what takes place on the Path to Value where data is gathered from patients at the point of care, moves into data warehouses to be prepared for analysis, then moves along the data pipeline to be transformed into valuable insights that can save lives, reduce costs, to improve healthcare and make it more accessible and affordable.

Aug 3rd 2026

4 Weeks

Healthcare Healthcare Data Data Models

Coursera

University of Colorado System

Data Warehouse Concepts, Design, and Data Integration (Coursera)

CS: Design & Product

This is the second course in the Data Warehousing for Business Intelligence specialization. Ideally, the courses should be taken in sequence. In this course, you will learn exciting concepts and skills for designing data warehouses and creating data integration workflows. These are fundamental skills for data warehouse developers and administrators. You will have hands-on experience for data warehouse design and use open source products for manipulating pivot tables and creating data integration workflows.

Aug 3rd 2026

5-12 Weeks

Business Intelligence Data Warehousing Data Warehouse

Coursera

Duke University

Advanced Data Engineering (Coursera)

CS: Software Engineering

In this advanced course, you will gain practical expertise in scaling data engineering systems using cutting-edge tools and techniques. This course is designed for data scientists, data engineers, and anyone with a foundational understanding of data handling who desires to escalate their skills to handle larger, more complex datasets efficiently.

Jul 27th 2026

4 Weeks

Databases Database Management Queues

Coursera

IBM

Data Engineering Capstone Project (Coursera)

CS: Information & Technology

In this course you will apply a variety of data engineering skills and techniques you have learned as part of the previous courses in the IBM Data Engineering Professional Certificate. You will assume the role of a Junior Data Engineer who has recently joined the organization and be presented with a real-world use case that requires a data engineering solution.

Jul 27th 2026

5-12 Weeks

Python NoSQL Spark

Coursera

Microsoft

Data Integration with Microsoft Azure Data Factory (Coursera)

Statistics & Data Analysis Data Science

In this course, you will learn how to create and manage data pipelines in the cloud using Azure Data Factory. This course is part of a Specialization intended for Data engineers and developers who want to demonstrate their expertise in designing and implementing data solutions that use Microsoft Azure data services. It is ideal for anyone interested in preparing for the DP-203: Data Engineering on Microsoft Azure exam (beta).

Jan 6th 2025

5-12 Weeks

Microsoft Azure Data Integration Azure

Coursera

Rice University

Distributed Programming in Java (Coursera)

CS: Software Engineering CS: Programming

This course teaches learners (industry professionals and students) the fundamental concepts of Distributed Programming in the context of Java 8. Distributed programming enables developers to use multiple nodes in a data center to increase throughput and/or reduce latency of selected applications. By the end of this course, you will learn how to use popular distributed programming frameworks for Java programs, including Hadoop, Spark, Sockets, Remote Method Invocation (RMI), Multicast Sockets, Kafka, Message Passing Interface (MPI), as well as different approaches to combine distribution with multithreading.

Aug 3rd 2026

4 Weeks

Programming Java MPI