Sep 27, 2019 · newsletter

Automating Weak Supervision

What is weak supervision?

We recently explored Snorkel, a weak supervision framework for learning when there are limited high-quality labels (see blog post and notebook). To use Snorkel, subject matter experts first write labeling functions to programmatically create labels. Very often these labeling functions attempt to capture heuristics. The labels are then fed into a generative model. The job of the generative model is to estimate the accuracy of the labeling functions while automatically taking into account the pairwise correlation between these functions and labeling propensity (how often a function actually creates a label). Once the generative model is trained, it can be used to estimate the true label for each candidate. The generative model outputs probabilistic labels - numbers between 0 and 1, representing the probability of a positive class. These probabilistic labels can be used to train any end model with a noise-aware loss.

Writing these labeling functions is sometimes not straight-forward; it can be time consuming and expensive. The idea behind Snuba (PDF) is to create a system to “automatically generate heuristics using a small labeled dataset to assign training labels to a large unlabeled dataset.” The labels generated by all these heuristics then feed into a weak supervision framework.

Automatically generating heuristics

Doing this step automatically requires replacing human reasoning that drives heuristic development. The authors take their cue from how humans generate heuristics in order to automate this process. From their observations, subject matter experts often fiddle with the correct threshold for each heuristic in order to make a correct classification. Radiologists, for example, try to figure out a threshold for each heuristic that uses a geometric property of a tumor in order to determine if it is malignant. In addition, subject matter experts tend to develop a single heuristic to assign accurate labels to a subset of the unlabeled data; covering the entire set of unlabeled data requires multiple heuristics. Lastly, humans stop generating heuristics when they have exhausted their domain knowledge.

Inner workings of Snuba

The proposed system works as follows, and requires a small set of labeled data to begin. The labeled data is first transformed into primitives (or features). For tumor images, this might mean numerical features such as area of perimeter of tumor. For text data, this might be one-hot vectors for the bag of words representation. Once we have the primitives, Snuba iteratively generates heuristics on a subset of the input data. Each iteration results in a new heuristic specialized to the subset of data that did not receive high confidence labels from the existing set of heuristics. In addition, the system knows when to stop. All these are accomplished using a three part architecture: synthesizer, pruner, and verifier.

Components of Snuba: synthesizer, pruner and verifier (image credit)


“The synthesizer takes as input the labeled dataset, or a subset of the labeled dataset after the first iteration, and outputs a candidate set of heuristics.” Each heuristic is actually a classification model - a decision stump, a logistic regressor, or a k_nearest neighbor classifier. These models take in primitives (feature representation of the original datapoint) and assign probabilistic labels to the data points. For binary classification, these are probabilities that the input primitive is a 1 (positive label) or a -1 (negative label).

Models for creating heuristics (image credit)

These probabilistic labels need to be turned into an actual label (since that’s what a human tries to do with heuristics). A straightforward approach to use probability = 0.5 as a threshold. Any probability less than 0.5 is considered a negative label, any probability above 0.5 is considered a positive label. Snuba builds in a threshold beta around 0.5, so anything greater than 0.5 + beta is a positive label, and anything less than 0.5 - beta is a negative label. All other values result in an “abstained” label. The system tries to find the beta that maximizes the F1 score on the labeled dataset. It does so by iterating through equally spaced values in beta (between 0 and 0.5), calculating the F1 score the heuristic achieves, and selecting beta that maximizes the F1 score. In doing so, Snuba is using the heuristic performance on the small labeled dataset as a proxy for the heuristic performance on the large unlabeled data set.


The pruner takes multiple candidate heuristics from the synthesizer and selects one to add to the existing set of heuristics. The goal is to select heuristics that label data points which have never received a label from other heuristics. At the same time, the selected heuristics should perform well when applied to the labeled dataset. To do this, the pruner uses a weighted average of Jaccard distance and F1 score to select the highest ranking heuristic from the candidate set.


The verifier takes care of the stopping condition. It uses the label aggregator (the generative model) to produce a single, probabilistic training label for each datapoint in the unlabeled dataset. It also identifies data points in the labeled dataset that receive low confidence labels (probability being close to 0.5). The verifier passes this subset to the synthesizer with the assumption that similar data . points in the unlabeled dataset would have also received low confidence labels. The stopping condition is met “if i) a statistical measure suggests the generative model in the synthesizer is not learning the accuracies of the heuristics properly, or ii) there are no low confidence data points in the small, labeled dataset.” The statistical measure uses the small, labeled dataset to indirectly determine whether the generated heuristics are worse than random for the unlabeled dataset.

Does it work?

The authors show that training labels from Snuba outperform labels from semi-supervised learning and from user-developed heuristics in terms of end model performance for tasks across various domains. These tasks include image classification and text and multi-modal classification.

In some ways Snuba reminds us of active learning - the iterative nature, the need for a stopping condition and the labeled dataset requirement. Active learning relies on the initial small labeled dataset to build a learner (or a model). A selection strategy then picks out data points that are difficult for the model and requests labels for them. The labeled data points (labeled by humans) are added back to the small labeled dataset and the process repeats. The learner gets better as a result. Snuba relies on the initial small labeled dataset to create some heuristics, and continues to use the same small labeled dataset to add more heuristics while evaluating diversity using the unlabeled dataset. Both need a stopping condition and Snuba’s stopping condition is better defined. We think Snuba seems promising, but wonder about the effect of generalizing from a small, labeled dataset to a large, unlabeled dataset.

Read more

Sep 27, 2019 · newsletter
Sep 5, 2019 · featured post

Latest posts

Dec 14, 2021 · post

An Introduction to Video Understanding: Capabilities and Applications

by Daniel Valdez Balderas · Video footage constitutes a significant portion of all data in the world. The 30 thousand hours of video uploaded to Youtube every hour is a part of that data; another portion is produced by 770 million surveillance cameras globally. In addition to being plentiful, video data has tremendous capacity to store useful information. Its vastness, richness, and applicability make the understanding of video a key activity within the field of computer vision. more
Sep 22, 2021 · post

Automatic Summarization from TextRank to Transformers

by Melanie Beck · Automatic summarization is a task in which a machine distills a large amount of data into a subset (the summary) that retains the most relevant and important information from the whole. While traditionally applied to text, automatic summarization can include other formats such as images or audio. In this article we’ll cover the main approaches to automatic text summarization, talk about what makes for a good summary, and introduce Summarize. – a summarization prototype we built that showcases several automatic summarization techniques. more
Sep 21, 2021 · post

Extractive Summarization with Sentence-BERT

by Victor Dibia · In extractive summarization, the task is to identify a subset of text (e.g., sentences) from a document that can then be assembled into a summary. Overall, we can treat extractive summarization as a recommendation problem. That is, given a query, recommend a set of sentences that are relevant. The query here is the document, relevance is a measure of whether a given sentence belongs in the document summary. How we go about obtaining this measure of relevance varies (a common dilemma for any recommendation system). more
Sep 20, 2021 · post

How (and when) to enable early stopping for Gensim's Word2Vec

by Melanie Beck · The Gensim library is a staple of the NLP stack. While it primarily focuses on topic modeling and similarity for documents, it also supports several word embedding algorithms, including what is likely the best-known implementation of Word2Vec. Word embedding models like Word2Vec use unlabeled data to learn vector representations for each token in a corpus. These embeddings can then be used as features in myriad downstream tasks such as classification, clustering, or recommendation systems. more
Jul 7, 2021 · post

Exploring Multi-Objective Hyperparameter Optimization

By Chris and Melanie. The machine learning life cycle is more than data + model = API. We know there is a wealth of subtlety and finesse involved in data cleaning and feature engineering. In the same vein, there is more to model-building than feeding data in and reading off a prediction. ML model building requires thoughtfulness both in terms of which metric to optimize for a given problem, and how best to optimize your model for that metric! more
Jun 9, 2021 ·

Deep Metric Learning for Signature Verification

By Victor and Andrew. TLDR; This post provides an overview of metric learning loss functions (constrastive, triplet, quadruplet, and group loss), and results from applying contrastive and triplet loss to the task of signature verification. A complete list of the posts in this series is outlined below: Pretrained Models as Baselines for Signature Verification -- Part 1: Deep Learning for Automatic Offline Signature Verification: An Introduction Part 2: Pretrained Models as Baselines for Signature Verification Part 3: Deep Metric Learning for Signature Verification In our previous blog post, we discussed how pretrained models can serve as strong baselines for the task of signature verification. more

Popular posts

Oct 30, 2019 · newsletter
Exciting Applications of Graph Neural Networks
Nov 14, 2018 · post
Federated learning: distributed machine learning with data locality and privacy
Apr 10, 2018 · post
PyTorch for Recommenders 101
Oct 4, 2017 · post
First Look: Using Three.js for 2D Data Visualization
Aug 22, 2016 · whitepaper
Under the Hood of the Variational Autoencoder (in Prose and Code)
Feb 24, 2016 · post
"Hello world" in Keras (or, Scikit-learn versus Keras)


In-depth guides to specific machine learning capabilities


Machine learning prototypes and interactive notebooks


A usable library for question answering on large datasets.

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

Interpretability Revisited: SHAP and LIME

Explore how to use LIME and SHAP for interpretability.


Cloudera Fast Forward is an applied machine learning reseach group.
Cloudera   Blog   Twitter