Updates from Cloudera Fast Forward on new research, prototypes, and exciting developments

March 2022

Welcome to the March edition of the Cloudera Fast Forward Labs newsletter. This month we scratch the surface on one of the most cutting-edge natural language processing capabilities and share some of our favorite reads.

New Research!

In January’s newsletter we shared a sneak peek of our next research endeavor, and this month we’re excited to share an update.

An Introduction to Text Style Transfer

TST

As humans, we understand that language is situational — not only is the context of written or spoken words important, but the style of text itself depends heavily on the user, environment, and scenario. As such, style is a vital consideration in making NLP more user centric. For this reason, the task of text style transfer aims to automatically control the style attributes of a given piece of text while preserving its content.

Text style transfer is an exciting task that’s gaining attention in the NLP research community due to its challenging nature and numerous real-world applications. We offer this introduction to the topic as Part 1 of a multi-part blog series where we will be experimenting with text style transfer by training and evaluating our own models. Along the way we plan to discuss our approach, implementation details, challenges, and learnings – stay tuned!

Fast Forward Live!

Check out replays of livestreams covering some of our research from last year.

Deep Learning for Automatic Offline Signature Verification

Session-based Recommender Systems

Few-Shot Text Classification

Representation Learning for Software Engineers

Recommended reading

CFFLers share their favorite reads of the month.

Controversies around organisations’ use of machine learning applications has resulted in significant backlash driven by affected groups as the introduction of a raft of legislation to force greater transparency and accountability. This has forced many organisations to, rightly, consider avenues for assessing the socio-technical impact of their applications – not just the performance metrics of the underlying models. This isn’t a minor undertaking. For one thing, it requires organisations to consider impacted groups that aren’t the primary target users of the product. For another, it requires them to work out ways of mitigating the negative impact as well offering avenues of redress. Often, these considerations aren’t accommodated within the development and operational processes of many organisations. Algorithmic audits and AI Bias bounties are two approaches that have emerged to fill this gap. This month I have been reading two reports exploring these approaches – Bug Bounties for Algorithmic Bias? by Algorithmic Justice League (AJL) and Algorithmic Accountability for the Public Sector by the Ada Lovelace Institute. — Ade

Data Distribution Shifts and Monitoring

Chip Huyen is at it again with another fantastic piece focused on ML system failure types, model & data drift, and observability. The article details the fact that most production ML failures are actually not specific to ML, but rather are attributed to general software system issues — “Because of the importance of traditional software engineering skills in deploying ML systems, the majority of ML engineering is engineering, not ML”.

Even though they account for just a small portion of all production failures, ML-specific failures can be more dangerous than non-ML failures because they’re harder to detect and correct. Chip goes on to provide a comprehensive and detailed review of production ML failure modes that I wish I had available when researching and writing our report on Inferring Concept Drift Without Labeled Data. This is a must read for anyone interested in or responsible for managing models in production. — Andrew

Lessons Learned on Language Model Safety and Misuse

When it first emerged from the academic world in 2020, GPT-3, then the largest language model ever trained, was kept behind closed doors for fear that its uncannily human-sounding generated text could cause irreparable harms. Due to these safety concerns, the model’s developers at OpenAI provided only limited access to the model’s outputs through an API. During this beta period, the OpenAI team has continuously evaluated their approach to deployment, as well as the model and its usage. This post summarizes some of the lessons their team has learned around safely deploying large language models and mitigating their misuse (spoiler alert: there’s no silver bullet). They also recently ended the waitlist for access to the OpenAI API (which includes GPT-3 and Codex), allowing more researchers and developers access to these technologies. OpenAI has a track record for being thoughtful about AI stewardship so this development will surely yield a new round of lessons to learn in responsible deployment. — Melanie