Updates from Cloudera Fast Forward on new research, prototypes, and exciting developments

Welcome to the December edition of Cloudera Fast Forward’s monthly newsletter. We have a bumper pack of releases for the holiday season: a new research release, the open sourcing of three previous reports, and, as usual, our team’s recommended reading for the month.

New research release!

Few-Shot Text Classification

BERT and Word2Vec discuss text classification

Text classification is a ubiquitous capability with a wealth of use cases including sentiment analysis, topic assignment, document identification, article recommendation, and more. But collecting enough annotated examples to train traditional classifiers can be quite costly. Instead, we take a look at a classic technique that can be used to perform text classification with few or even zero training examples! We’re talking about text embeddings, of course. New advances have significantly increased the quality of document embeddings and in our newest writing on Few Shot Text Classification this cycle we cover

how to use them for topic classification,
best practices for using them,
and potential limitations.

Follow the links in the report to find code snippets so you can try it for yourself, and build your own demo so you can see the method in action!

Federated Learning open source

The Federated Learning report cover

Two years ago we wrote a research report about Federated Learning. We’re pleased to make the report available to everyone, for free. You can read it online here: Federated Learning.

In the time since, it has only grown in relevance. Numerous startups have cropped up (and some disappeared by acquisition) with Federated Learning as their core technology. Google continues to promote the technology, including for non-machine learning use cases, as in Federated Analytics: Collaborative Data Science without Data Collection. This year saw (what we believe to be) the first conferences with a heavy focus on federated learning, The Federated Learning Conference and the Open Mined Privacy Conference, as well as dedicated workshops at high profile machine learning conferences like ICML and NeurIPS.

OpenMined continues to build a strong community around private machine learning, creating courses and open source tools to lower the barrier-to-entry to federated learning and related privacy enhancing techniques. Alongside those, TensorFlow Federated, IBM’s federated learning library and flower.dev are extending the tooling ecosystem.

Federated Learning is no panacea. In a privacy setting, decentralized data simply presents a different attack surface to centralized data. Not all applications require or benefit from federation. However, it is an important tool in the private machine learning toolkit.

Deep Learning for Image Analysis

To accompany last month’s research on Semantic Image Search (checkout the associated blog post Representation Learning 101 for Software Engineers), we’re opening up some more previous reports:

Deep Learning for Image Analysis is an oldie, having been released back in 2015, but still provides an introduction for the uninitiated.
Our more recent release on the same, Deep Learning for Image Analysis: 2019 edition, substantially expands the first report and covers some practical considerations, like trading off accuracy and latency, and interpreting model predictions. July 2019 is a long time ago in computer vision research, and while the benchmarks may have improved, the underlying concepts discussed are still relevant.

Recommended reading

Our research engineers share their favorite reads of the month.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList As a ML practitioner, one of the primary ways of evaluating a model’s performance is through measuring it’s accuracy on a held-out test dataset. While this is a useful indicator, time and again it has been recognized that such an approach does not necessarily mean that the model will generalize well in a production environment. Further, such a performance indicator provides little insight into where the model fails, and how one could fix it. So the question is how could one check whether a model really works? This is where one could borrow from software engineering practices. The authors (also the authors of the popular model explainability tool - LIME) propose CheckList, a task and model-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList guides users in what to test by suggesting a list of linguistic capabilities. For example, testing vocabulary of the model or how it deals with named entities or negation. Further, it tests for potential capability failures by introducing different test types, such as prediction invariance in presence of certain perturbations. In addition, it provides users with tooling to generate hundreds of test cases easily using templates, lexicons, general purpose perturbations, visualizations and others. The authors also illustrate it’s utility by highlighting significant problems for some SOTA models and reveal bugs in commercial systems developed by large software companies. More importantly, it can help come up with a standardized way of evaluating NLP models moving beyond just accuracy on held-out datasets! — Nisha
You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise In his keynote presentation at NeurIPS 2020, Charles Isbell delivers a video montage conveying a stark reality of applied machine learning today - the field of ML is too narrowly focused on solutions without a broader consideration for their impact on the real human world. By sharing perspectives from multiple researchers, Charles Isbell makes the argument that the community needs to adopt systematic approaches rooted in software engineering, programming languages, ethics, and diversity to design and build more robust, holistic machine learning systems. — Andrew
Towards Clarifying the Theory of the Deconfounder Back when we wrote our report Causality for Machine Learning, I spent some time with the so-called “deconfounder” model of Wang and Blei. The original paper, The Blessings of Multiple Causes landed with a splash, generating much excitement in the machine learning community and skepticism among causal inference researchers. In a fantastic example of science in action, numerous subsequent papers have clarified the theory and use of the proposed technique. The essence of the original paper is that by constructing a latent factor model of multiple causes, under some assumptions, we can identify and estimate causal effects under weaker assumptions than usual. It turns out that there are additional assumptions in the new method too, but it does provide a new tool for causal inference in some scenarios. I remain convinced that causality is a necessary framework for machine learning in industry, and carries many benefits even when we care only about prediction. — Chris
AlphaFold: a solution to a 50-year-old grand challenge in biology This interactive article tells what seems like a fairy-tale story of how AI saves the day in the world of biology. Only it’s not a fairy tale! Fifty years ago a Nobel Prize-winning chemist predicted that a protein’s amino acid sequence should fully determine the shape of its structure. Knowing a protein’s shape and structure can provide insights into new diseases, allowing for better drug development and treatments. Unfortunately, predicting protein shapes is an astronomically difficult challenge. Known as the “protein folding problem,” one estimate is that any given protein could have 10^300 different possible configurations. Searching by brute force for the right structure would take longer than the age of the Universe! Until now. DeepMind’s AlphaFold has recently been recognized as a solution for the protein folding problem and demonstrates how AI can have a deeply meaningful impact on the life sciences and scientific discovery. — Melanie
Exhalation - Ted Chiang Exhalation is the second collection of short fiction from the author of Story of Your Life (adapted into the hit sci-fi film Arrival). The stories are the best kind of sci-fi, being both cerebral and grounded in a profound appreciation of the human experience. — Chris
How to Do Nothing - Jenny Odell This book from Bay Area-based installation and internet artist Odell was recommended to me several times during the year of its publishing (2019). It wasn’t until partway through 2020 (a year in which it’s felt at times like we are all are being forced to do nothing) that I finally picked it up. I’m glad I did. Odell offers an antidote to doomscrolling that is both permission to step back from engaging with contemporary concerns and a call to action to pay attention to ourselves and each other in new ways. I’m enjoying this one slowly, with a long walk and a few days between each chapter. — Juno