EdX

Building ETL and Data Pipelines with Bash, Airflow and Kafka (edX)

Offered by IBM,
Building ETL and Data Pipelines with Bash, Airflow and Kafka (edX)

This course provides you with practical skills to build and manage data pipelines and Extract, Transform, Load (ETL) processes using shell scripts, Airflow and Kafka. Well-designed and automated data pipelines and ETL processes are the foundation of a successful Business Intelligence platform. Defining your data workflows, pipelines and processes early in the platform design ensures the right raw data is collected, transformed and loaded into desired storage layers and available for processing and analysis as and when required.

Class Deals by MOOC List - Click here and see EdX's Active Discounts, Deals, and Promo Codes.

This course is designed to provide you the critical knowledge and skills needed by Data Engineers and Data Warehousing specialists to create and manage ETL, ELT, and data pipeline processes.
Upon completing this course you’ll gain a solid understanding of Extract, Transform, Load (ETL), and Extract, Load, and Transform (ELT) processes; practice extracting data, transforming data, and loading transformed data into a staging area; create an ETL data pipeline using Bash shell-scripting, build a batch ETL workflow using Apache Airflow and build a streaming data pipeline using Apache Kafka.
You’ll gain hands-on experience with practice labs throughout the course and work on a real-world inspired project to build data pipelines using several technologies that can be added to your portfolio and demonstrate your ability to perform as a Data Engineer.
This course pre-requisites that you have prior skills to work with datasets, SQL, relational databases, and Bash shell scripts.
This course is part of the following programs:

What you'll learn

  • Describe and differentiate between Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) processes.
  • Define data pipeline components, processes, tools, and technologies.
  • Create batch ETL processes using Apache Airflow and streaming data pipelines using Apache Kafka.
  • Demonstrate understanding of how shell-scripting is used to implement an ETL pipeline.
Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Distributed Programming in Java (Coursera) Coursera
Rice University

Distributed Programming in Java (Coursera)

This course teaches learners (industry professionals and students) the fundamental concepts of Distributed Programming in the context of Java 8. Distributed programming enables developers to use multiple nodes in a data center to increase throughput and/or reduce latency of selected applications. By the end of this course, you will learn how to use popular distributed programming frameworks for Java programs, including Hadoop, Spark, Sockets, Remote Method Invocation (RMI), Multicast Sockets, Kafka, Message Passing Interface (MPI), as well as different approaches to combine distribution with multithreading.

Jun 8th 2026
4 Weeks
Introduction to SQL (edX) EdX
IBM

Introduction to SQL (edX)

Learn how to use and apply the powerful language of SQL to better communicate and extract data from databases - a must for anyone working in Data Engineering, Data Analytics or Data Science. Much of the world's data lives in databases. SQL (or Structured Query Language) is a powerful programming language that is used for communicating with and manipulating data in databases.

Self Paced
Self-Paced
Customising your models with TensorFlow 2 (Coursera) Coursera
Imperial College London

Customising your models with TensorFlow 2 (Coursera)

Welcome to this course on Customising your models with TensorFlow 2! In this course you will deepen your knowledge and skills with TensorFlow, in order to develop fully customised deep learning models and workflows for any application. You will use lower level APIs in TensorFlow to develop complex model architectures, fully customised layers, and a flexible data workflow. You will also expand your knowledge of the TensorFlow APIs to include sequence models.

Jun 8th 2026
5-12 Weeks
Machine Learning Operations 2 (MLOps2-GCP): Data Pipeline Automation & Optimization using Google Cloud Platform (GCP) (edX) EdX
Statistics.comX,Statistics.com

Machine Learning Operations 2 (MLOps2-GCP): Data Pipeline Automation & Optimization using Google Cloud Platform (GCP) (edX)

Most data science projects fail. There are various reasons why, but one of the primary reasons is the challenge of deployment. One piece to the deployment puzzle is understanding how to automate your pipeline’s functions and continuously optimize its performance, which is why we developed this course, MLOp2s: Data Pipeline Automation & Optimization using Google Cloud Platform (GCP).

Self Paced
Self-Paced
Machine Learning Operations 2 (MLOps2-AWS): Data Pipeline Automation & Optimization using Amazon Web Services (AWS) (edX) EdX
Statistics.comX,Statistics.com

Machine Learning Operations 2 (MLOps2-AWS): Data Pipeline Automation & Optimization using Amazon Web Services (AWS) (edX)

Most data science projects fail. There are various reasons why, but one of the primary reasons is the challenge of deployment. One piece to the deployment puzzle is understanding how to automate your pipeline’s functions and continuously optimize its performance, which is why we developed this course - MLOp2s: Data Pipeline Automation & Optimization using Amazon Web Services (AWS).

Self Paced
Self-Paced
Hands-on Introduction to Linux Commands and Shell Scripting (Coursera) Coursera
IBM

Hands-on Introduction to Linux Commands and Shell Scripting (Coursera)

This mini-course provides a practical introduction to commonly used Linux / UNIX shell commands and teaches you basics of Bash shell scripting to automate a variety of tasks. The course includes both video-based lectures as well as hands-on labs to practice and apply what you learn. You will have no-charge access to a virtual Linux server that you can access through your web browser, so you don't need to download and install anything to perform the labs.

Jun 8th 2026
1 Week
Python Basics for Data Science (edX) EdX
IBM

Python Basics for Data Science (edX)

This Python course provides a beginner-friendly introduction to Python for Data Science. Practice through lab exercises, and you'll be ready to create your first Python scripts on your own! Kickstart your learning of Python for data science, as well as programming in general with this introduction to Python course. This beginner-friendly Python course will quickly take you from zero to programming in Python in a matter of hours and give you a taste of how to start working with data in Python.

Self Paced
Self-Paced
Healthcare Data Models (Coursera) Coursera
University of California, Davis

Healthcare Data Models (Coursera)

Career prospects are bright for those qualified to work in healthcare data analytics. Perhaps you work in data analytics, but are considering a move into healthcare where your work can improve people’s quality of life. If so, this course gives you a glimpse into why this work matters, what you’d be doing in this role, and what takes place on the Path to Value where data is gathered from patients at the point of care, moves into data warehouses to be prepared for analysis, then moves along the data pipeline to be transformed into valuable insights that can save lives, reduce costs, to improve healthcare and make it more accessible and affordable.

Jun 8th 2026
4 Weeks
Relational Database Basics (edX) EdX
IBM

Relational Database Basics (edX)

This course teaches you the fundamental concepts of relational databases and Relational Database Management Systems (RDBMS). This course is an introduction to the world of relational databases. You will explore the fundamental concepts of relational databases and Relational Database Management Systems (RDBMS), learn about relational database design, and understand how to transform source data into tables with clearly defined relationships.

Self Paced
Self-Paced