Welcome to the June edition of Cloudera Fast Forward’s monthly newsletter. This month, alongside our regular recommended reading, we have two exciting research announcements!
Here in the Fast Forward lab, we’re always asking ourselves a lot of questions. Now we’re asking BERT a lot of questions too! Our current research focus is question answering systems. In place of a report with all our learnings at the end of our research, we’re inviting you to follow along as we explore building a question answering system using modern neural architectures. We just released our third blog in the series, and you can check out each of them below:
Intro to Automated Question Answering
This introductory post discusses what QA is and isn’t, where this technology is being employed, and what techniques are used to accomplish this natural language task.
Building a QA System with BERT on Wikipedia
Follow along with this post to build a working Information Retrieval-based QA system, with BERT as the document reader and Wikipedia’s search engine as the document retriever. This is a fun toy model that that hints at potential real-world use cases.
Evaluating QA: Metrics, Predictions, and the Null Response
In this post, we look at how to assess the quality of a BERT-like model for Question Answering. We cover what metrics are used to quantify quality, how to evaluate your model using the Hugging Face framework, and the importance of the “null response” – questions that don’t have answers – for both improved performance and more realistic QA output.
Our latest research report — Causality for Machine Learning — is live, and the webinar is available on demand!
Causality is an emerging area of focus in data science practice, especially when we want to make decisions based on our models. Causality provides a framework for understanding which statistical relationships are true, and which only appear to be true in some circumstances. Our report provides guidance on when and how we need to think about causality.
Even when a problem does not require causal reasoning, we can greatly improve the robustness and generalizability of our machine learning models by taking some lessons from causality. The report outlines techniques that enable machine learning models to perform well across diverse unseen environments, including those that they were not trained on. This is applicable to any machine learning problem where we would like our models to perform well across diverse environments. In particular there are applications in natural language processing and computer vision, which we demonstrate in the accompanying prototype, Scene.
The frontier of simulation based inference Deep learning gets a lot of attention, but it is not the only area of machine learning where progress has been made. Performing inference over intractable likelihoods is widely applicable, and especially where a sophisticated simulator already exists. Examples discussed focus mostly on physical phenomena, from particle colliders, to the structure of the Universe, to epidemics. Personally, I suspect that useful simulators could be constructed for many applied and industrial use cases. This paper outlines recent developments that are enabling statistical inference over these highly complex simulators. - Chris
The Neural Hype and Comparisons Against Weak Baselines This paper illustrates how the use of weak baselines can artificially inflate the “value” of neural methods in the information retrieval domain. In a particularly interesting example, the authors demonstrate how a well designed query expansion approach (BM25 + RM3) provide comparable performance to neural retrieval methods at a fraction of the cost and complexity. - Victor
A Survey of Methods for Model Compression in NLP Transformer-based language models have dramatically advanced the performance of several NLP capabilities like machine translation, summarization, and question answering. However, these models are huge and computationally complex, and implementing them in practice remains challenging with traditional computational constraints. This post summarizes several avenues of research that aim to reduce the size of these beastly models without sacrificing the quality of their output — from software efficiency tricks to techniques like knowledge distillation — making them smaller, lighter, faster and, ultimately, more accessible for practical applications. - Melanie
Shortcut Learning in Deep Neural Networks While the era of deep learning has brought about significant progress and advancement in the field of machine intelligence, a general focus on shattering traditional performance benchmarks has overshadowed the due diligence needed to truly understand model limitations. This paper discusses shortcut learning — the idea that deep learning models often learn superficial rules to perform well on a given benchmark and consequently fail to transfer that level of performance to real-world scenarios. The authors present a set of recommendations for proper model interpretation and performance benchmarking. - Andrew
StereoSet: Measuring stereotypical bias in pretrained language models This paper releases a thoughtfully crowdsourced dataset along with an evaluation metric (ICAT) to measure biases in four domains: race, gender, religion and professions. The authors tested various pretrained models including BERT, RoBERTA, XLNet, GPT2. They show, not surprisingly, that these models exhibit strong stereotypical biases from both a within-sentence and intra-sentence perspective. Dataset and leaderboard can be found at https://stereoset.mit.edu/ - Shioulin
Designing and evaluating metrics Our ability to capture data and measure outcomes enables us to improve our products, decisions, user experience and such. And while this is increasingly applicable in the data science world, it’s been how scientific pursuits have been long accomplished. This blogpost by Sean Taylor discusses five main properties to keep in mind when designing a metric: cost, simplicity, faithfulness (which is often invalidated due to sampling bias or by measuring the wrong thing), precision and causal proximity (that measures how closely your product or decision changes affect the metric). Further, he discusses how metric design is an iterative process that requires input from multiple stakeholders and like code involves testing, re-evaluation, tweaking and could be eventually replaced. - Nisha