Coursera

Machine Learning: Clustering & Retrieval (Coursera)

Offered by University of Washington,

Case Studies: Finding Similar Documents. A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.
By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Identify various similarity metrics for text data.
-Reduce computations in k-nearest neighbor search by using KD-trees.
-Produce approximate nearest neighbors using locality sensitive hashing.
-Compare and contrast supervised and unsupervised learning tasks.
-Cluster documents by topic using k-means.
-Describe how to parallelize k-means using MapReduce.
-Examine probabilistic clustering approaches using mixtures models.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.
Course 4 of 4 in the Machine Learning Specialization.

Syllabus

WEEK 1
Welcome
Clustering and retrieval are some of the most high-impact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

WEEK 2
Nearest Neighbor Search
We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

WEEK 3
Clustering with k-means
In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by "topic". These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be post-facto associated with known topics like "Science", "World News", etc. Even without such post-facto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is k-means, which is the most widely used clustering algorithm out there. To scale up k-means, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework. You will show that k-means can provide an interpretable grouping of Wikipedia articles when appropriately tuned.

WEEK 4
Mixture Models
In k-means, observations are each hard-assigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic model-based clustering that provides (1) a more descriptive notion of a "cluster" and (2) accounts for uncertainty in assignments of datapoints to clusters via "soft assignments". You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the high-dimensionality of the tf-idf document representation considered.

WEEK 5
Mixed Membership Modeling via Latent Dirichlet Allocation
The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.

WEEK 6
Hierarchical Clustering & Closing Remarks
In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clustering-type ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course. We conclude with an overview of what's in store for you in the rest of the specialization.

Go to Class

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Coursera

Karlsruhe Institute of Technology - KIT

Machine Translation (Coursera)

Data Science

Welcome to the CLICS-Machine Translation MOOC. This MOOC explains the basic principles of machine translation. Machine translation is the task of translating from one natural language to another natural language. Therefore, these algorithms can help people communicate in different languages. Such algorithms are used in common applications, from Google Translate to apps on your mobile device.

Aug 3rd 2026

5-12 Weeks

Machine Learning Translation Data Science

Coursera

Sungkyunkwan University - SKKU

Machine Learning for Smart Beta (Coursera)

Economics & Finance Business

In this 4 week course, you will learn about Smart Beta products. Smart betas products have the characteristics of both passive investment(having predetermined rules) and active investments(allows for factor investment). We will walk through the creation mechanisms behind different smart beta products and recreate some of them using R programming.

Jul 27th 2026

4 Weeks

Machine Learning Regression Classification

Coursera

University of Illinois at Urbana-Champaign

Machine Learning for Accounting with Python (Coursera)

Data Science

This course, Machine Learning for Accounting with Python, introduces machine learning algorithms (models) and their applications in accounting problems. It covers classification, regression, clustering, text analysis, time series analysis. It also discusses model evaluation and model optimization. This course provides an entry point for students to be able to apply proper machine learning models on business related datasets with Python to solve various problems.

Jul 27th 2026

5-12 Weeks

Algorithms Machine Learning Clustering

Coursera

Illinois Tech

Cloud: Platform as a Service - Bachelor's (Coursera)

CS: Information & Technology

This course is aimed at preparing individuals to gain knowledge, skills, and abilities to demonstrate the knowledge for managing Platform as a Service (PaaS) in the Cloud. Students will learn to deploy, operate, and maintain cloud platforms for storing, processing, and transferring information with architecture design principles and a structured approach. Students will also learn the shared responsibility model and cloud security best practices to secure PaaS platforms for the application-hosting environments.

Aug 3rd 2026

5-12 Weeks

Cloud Machine Learning PaaS

Coursera

Arm

Getting Started with Machine Learning at the Edge on Arm (Coursera)

Computer Science

The age of machine learning has arrived! Arm technology is powering a new generation of connected devices with sophisticated sensors that can collect a vast range of environmental, spatial and audio/visual data. Typically this data is processed in the cloud using advanced machine learning tools that are enabling new applications reshaping the way we work, travel, live and play.

Aug 3rd 2026

5-12 Weeks

Machine Learning Computer Vision IoT

Coursera

Edureka

Gen AI in Cybersecurity (Coursera)

Security & Networking

Embark on a transformative journey into the realm of Gen AI in Cybersecurity with our comprehensive course. Dive deep into the intricacies of harnessing artificial intelligence to secure digital landscapes, from foundational principles to cutting-edge concepts.

Jul 27th 2026

1 Week

Machine Learning Data Analysis Cybersecurity

Coursera

Politecnico di Milano,EIT Digital

Data Science for Business Innovation (Coursera)

Statistics & Data Analysis Data Science

The course is a compendium of the must-have expertise in data science for executive and middle-management to foster data-driven innovation. It consists of introductory lectures spanning big data, machine learning, data valorization and communication. Topics cover the essential concepts and intuitions on data needs, data analysis, machine learning methods, respective pros and cons, and practical applicability issues.

Jul 27th 2026

4 Weeks

NoSQL Machine Learning Big Data

Coursera

Google Cloud

Preparing for the Google Cloud Professional Data Engineer Exam em Português Brasileiro (Coursera)

CS: Information & Technology Computer Science

Por que fazer o curso: "A melhor forma de se preparar para o exame é ser competente nas habilidades necessárias ao trabalho." Este curso usa uma abordagem "top-down". Ele identifica as habilidades que você já tem e apresenta novas informações e áreas para ampliar seus conhecimentos. Use este curso para criar seu plano de preparação personalizado. Ele ajudará você a identificar o que sabe e o que precisa estudar mais, além de desenvolver e praticar as habilidades necessárias às competências do cargo.

Aug 3rd 2026

1 Week

Machine Learning Data Engineer Google Cloud

Coursera

IIT Roorkee

Supply Chain Analytics (Coursera)

Business

Welcome to Supply Chain Analytics! In this course you will learn about advanced decision problems in Supply Chain Management and the application of optimisation formulations and their solutions to address them. The course has been designed to help you advance your career as business analysts, supply chain managers, and other similar roles by learning in-demand skills to increase efficiency, drive organisational growth, and make a positive business impact. The course also offers a good starting point to those with purely academic and research interests.

Jul 27th 2026

5-12 Weeks

Machine Learning Forecasting Supply Chain

Coursera

Edge Impulse

Computer Vision with Embedded Machine Learning (Coursera)

Data Science

Computer vision (CV) is a fascinating field of study that attempts to automate the process of assigning meaning to digital images or videos. In other words, we are helping computers see and understand the world around us! A number of machine learning (ML) algorithms and techniques can be used to accomplish CV tasks, and as ML becomes faster and more efficient, we can deploy these techniques to embedded systems.

Aug 3rd 2026

3 Weeks

Machine Learning Computer Vision Object Detection

Coursera

University of California, San Diego

Code Free Data Science (Coursera)

Data Science

The Code Free Data Science class is designed for learners seeking to gain or expand their knowledge in the area of Data Science. Participants will receive the basic training in effective predictive analytic approaches accompanying the growing discipline of Data Science without any programming requirements. Machine Learning methods will be presented by utilizing the KNIME Analytics Platform to discover patterns and relationships in data.

Aug 3rd 2026

4 Weeks

Machine Learning Big Data Data Science

Coursera

Google Cloud

Preparing for the Google Cloud Professional Data Engineer Exam (Coursera)

CS: Information & Technology Computer Science

From the course: "The best way to prepare for the exam is to be competent in the skills required of the job." This course uses a top-down approach to recognize knowledge and skills already known, and to surface information and skill areas for additional preparation. You can use this course to help create your own custom preparation plan. It helps you distinguish what you know from what you don't know. And it helps you develop and practice skills required of practitioners who perform this job.

Jul 27th 2026

5-12 Weeks

Machine Learning Data Engineer Google Cloud