Sep 27, 2019 · newsletter

Automating Weak Supervision

What is weak supervision?

We recently explored Snorkel, a weak supervision framework for learning when there are limited high-quality labels (see blog post and notebook). To use Snorkel, subject matter experts first write labeling functions to programmatically create labels. Very often these labeling functions attempt to capture heuristics. The labels are then fed into a generative model. The job of the generative model is to estimate the accuracy of the labeling functions while automatically taking into account the pairwise correlation between these functions and labeling propensity (how often a function actually creates a label). Once the generative model is trained, it can be used to estimate the true label for each candidate. The generative model outputs probabilistic labels - numbers between 0 and 1, representing the probability of a positive class. These probabilistic labels can be used to train any end model with a noise-aware loss.

Writing these labeling functions is sometimes not straight-forward; it can be time consuming and expensive. The idea behind Snuba (PDF) is to create a system to “automatically generate heuristics using a small labeled dataset to assign training labels to a large unlabeled dataset.” The labels generated by all these heuristics then feed into a weak supervision framework.

Automatically generating heuristics

Doing this step automatically requires replacing human reasoning that drives heuristic development. The authors take their cue from how humans generate heuristics in order to automate this process. From their observations, subject matter experts often fiddle with the correct threshold for each heuristic in order to make a correct classification. Radiologists, for example, try to figure out a threshold for each heuristic that uses a geometric property of a tumor in order to determine if it is malignant. In addition, subject matter experts tend to develop a single heuristic to assign accurate labels to a subset of the unlabeled data; covering the entire set of unlabeled data requires multiple heuristics. Lastly, humans stop generating heuristics when they have exhausted their domain knowledge.

Inner workings of Snuba

The proposed system works as follows, and requires a small set of labeled data to begin. The labeled data is first transformed into primitives (or features). For tumor images, this might mean numerical features such as area of perimeter of tumor. For text data, this might be one-hot vectors for the bag of words representation. Once we have the primitives, Snuba iteratively generates heuristics on a subset of the input data. Each iteration results in a new heuristic specialized to the subset of data that did not receive high confidence labels from the existing set of heuristics. In addition, the system knows when to stop. All these are accomplished using a three part architecture: synthesizer, pruner, and verifier.

Components of Snuba: synthesizer, pruner and verifier (image credit)


“The synthesizer takes as input the labeled dataset, or a subset of the labeled dataset after the first iteration, and outputs a candidate set of heuristics.” Each heuristic is actually a classification model - a decision stump, a logistic regressor, or a k_nearest neighbor classifier. These models take in primitives (feature representation of the original datapoint) and assign probabilistic labels to the data points. For binary classification, these are probabilities that the input primitive is a 1 (positive label) or a -1 (negative label).

Models for creating heuristics (image credit)

These probabilistic labels need to be turned into an actual label (since that’s what a human tries to do with heuristics). A straightforward approach to use probability = 0.5 as a threshold. Any probability less than 0.5 is considered a negative label, any probability above 0.5 is considered a positive label. Snuba builds in a threshold beta around 0.5, so anything greater than 0.5 + beta is a positive label, and anything less than 0.5 - beta is a negative label. All other values result in an “abstained” label. The system tries to find the beta that maximizes the F1 score on the labeled dataset. It does so by iterating through equally spaced values in beta (between 0 and 0.5), calculating the F1 score the heuristic achieves, and selecting beta that maximizes the F1 score. In doing so, Snuba is using the heuristic performance on the small labeled dataset as a proxy for the heuristic performance on the large unlabeled data set.


The pruner takes multiple candidate heuristics from the synthesizer and selects one to add to the existing set of heuristics. The goal is to select heuristics that label data points which have never received a label from other heuristics. At the same time, the selected heuristics should perform well when applied to the labeled dataset. To do this, the pruner uses a weighted average of Jaccard distance and F1 score to select the highest ranking heuristic from the candidate set.


The verifier takes care of the stopping condition. It uses the label aggregator (the generative model) to produce a single, probabilistic training label for each datapoint in the unlabeled dataset. It also identifies data points in the labeled dataset that receive low confidence labels (probability being close to 0.5). The verifier passes this subset to the synthesizer with the assumption that similar data . points in the unlabeled dataset would have also received low confidence labels. The stopping condition is met “if i) a statistical measure suggests the generative model in the synthesizer is not learning the accuracies of the heuristics properly, or ii) there are no low confidence data points in the small, labeled dataset.” The statistical measure uses the small, labeled dataset to indirectly determine whether the generated heuristics are worse than random for the unlabeled dataset.

Does it work?

The authors show that training labels from Snuba outperform labels from semi-supervised learning and from user-developed heuristics in terms of end model performance for tasks across various domains. These tasks include image classification and text and multi-modal classification.

In some ways Snuba reminds us of active learning - the iterative nature, the need for a stopping condition and the labeled dataset requirement. Active learning relies on the initial small labeled dataset to build a learner (or a model). A selection strategy then picks out data points that are difficult for the model and requests labels for them. The labeled data points (labeled by humans) are added back to the small labeled dataset and the process repeats. The learner gets better as a result. Snuba relies on the initial small labeled dataset to create some heuristics, and continues to use the same small labeled dataset to add more heuristics while evaluating diversity using the unlabeled dataset. Both need a stopping condition and Snuba’s stopping condition is better defined. We think Snuba seems promising, but wonder about the effect of generalizing from a small, labeled dataset to a large, unlabeled dataset.

Read more

Sep 27, 2019 · newsletter
Sep 5, 2019 · featured post

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

November 2022 Perhaps November conjures thoughts of holiday feasts and festivities, but for us, it’s the perfect time to chew the fat about machine learning! Make room on your plate for a peek behind the scenes into our current research on harnessing synthetic image generation to improve classification tasks. And, as usual, we reflect on our favorite reads of the month. New Research! In the first half of this year, we focused on natural language processing with our Text Style Transfer blog series. more
Nov 14, 2022 · post

Implementing CycleGAN

by Michael Gallaspy · Introduction This post documents the first part of a research effort to quantify the impact of synthetic data augmentation in training a deep learning model for detecting manufacturing defects on steel surfaces. We chose to generate synthetic data using CycleGAN,1 an architecture involving several networks that jointly learn a mapping between two image domains from unpaired examples (I’ll elaborate below). Research from recent years has demonstrated improvement on tasks like defect detection2 and image segmentation3 by augmenting real image data sets with synthetic data, since deep learning algorithms require massive amounts of data, and data collection can easily become a bottleneck. more
Oct 20, 2022 · newsletter

CFFL October Newsletter

October 2022 We’ve got another action-packed newsletter for October! Highlights this month include the re-release of a classic CFFL research report, an example-heavy tutorial on Dask for distributed ML, and our picks for the best reads of the month. Open Data Science Conference Cloudera Fast Forward Labs will be at ODSC West near San Fransisco on November 1st-3rd, 2022! If you’ll be in the Bay Area, don’t miss Andrew and Melanie who will be presenting our recent research on Neutralizing Subjectivity Bias with HuggingFace Transformers. more
Sep 21, 2022 · newsletter

CFFL September Newsletter

September 2022 Welcome to the September edition of the Cloudera Fast Forward Labs newsletter. This month we’re talking about ethics and we have all kinds of goodies to share including the final installment of our Text Style Transfer series and a couple of offerings from our newest research engineer. Throw in some choice must-reads and an ASR demo, and you’ve got yourself an action-packed newsletter! New Research! Ethical Considerations When Designing an NLG System In the final post of our blog series on Text Style Transfer, we discuss some ethical considerations when working with natural language generation systems, and describe the design of our prototype application: Exploring Intelligent Writing Assistance. more
Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

by Michael Gallaspy · This post has a companion piece: Ethics Sheet for AI-assisted Comic Book Art Generation I want to make a comic book. Actually, I want to make tools for making comic books. See, the problem is, I can’t draw too good. I mean, I’m working on it. Check out these self portraits drawn 6 months apart: Left: “Sad Face”. February 2022. Right: “Eyyyy”. August 2022. But I have a long way to go until my illustrations would be considered professional quality, notwithstanding the time it would take me to develop the many other skills needed for making comic books. more
Aug 18, 2022 · newsletter

CFFL August Newsletter

August 2022 Welcome to the August edition of the Cloudera Fast Forward Labs newsletter. This month we’re thrilled to introduce a new member of the FFL team, share TWO new applied machine learning prototypes we’ve built, and, as always, offer up some intriguing reads. New Research Engineer! If you’re a regular reader of our newsletter, you likely noticed that we’ve been searching for new research engineers to join the Cloudera Fast Forward Labs team. more

Popular posts

Oct 30, 2019 · newsletter
Exciting Applications of Graph Neural Networks
Nov 14, 2018 · post
Federated learning: distributed machine learning with data locality and privacy
Apr 10, 2018 · post
PyTorch for Recommenders 101
Oct 4, 2017 · post
First Look: Using Three.js for 2D Data Visualization
Aug 22, 2016 · whitepaper
Under the Hood of the Variational Autoencoder (in Prose and Code)
Feb 24, 2016 · post
"Hello world" in Keras (or, Scikit-learn versus Keras)


In-depth guides to specific machine learning capabilities


Machine learning prototypes and interactive notebooks

ASR with Whisper

Explore the capabilities of OpenAI's Whisper for automatic speech recognition by creating your own voice recordings!


A usable library for question answering on large datasets.

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

Cloudera Fast Forward Labs

Making the recently possible useful.

Cloudera Fast Forward Labs is an applied machine learning research group. Our mission is to empower enterprise data science practitioners to apply emergent academic research to production machine learning use cases in practical and socially responsible ways, while also driving innovation through the Cloudera ecosystem. Our team brings thoughtful, creative, and diverse perspectives to deeply researched work. In this way, we strive to help organizations make the most of their ML investment as well as educate and inspire the broader machine learning and data science community.

Cloudera   Blog   Twitter

©2022 Cloudera, Inc. All rights reserved.