Updates from Cloudera Fast Forward on new research, prototypes, and exciting developments

Welcome to the January edition of Cloudera Fast Forward’s monthly newsletter.

Top ML research developments of 2020

In lieu of new research of our own this month, we asked each of our research engineers to take a look back at their top research developments of 2020.

The prevalence of MLOps

Andrew

It has long been known that world of machine learning lacks the rigor and discipline of traditional software engineering practices. In 2020, as more organizations matured in ML capability from experimentation towards integrated production systems, the need for such discipline has become increasingly apparent. Scaling ML effectively is difficult and requires deliberate capacity for model and data lifecycle management, versioning and iteration, governance, release integration, monitoring, and testing.

Consequently, MLOps - an engineering culture and practice that aims to unify ML system development and ML system operations - has taken a strong hold within the ML community. MLOps advocates for automation and monitoring at all steps of system construction, and while a slew of tools and technologies have emerged to help satiate the needs of production ML, attention must also be paid to the formation of ML roles, teams, and processes.

In the year ahead, I hope to see the practice of MLOps shift from a nice-to-have afterthought towards an upfront requirement for ML projects, as well as the standardization and alignment of tools and tasks involved.

Language models are bigger than ever — but are they better?

Melanie

NLP models are just getting ridiculous.

The past twelve months have seen records repeatedly shattered: the size of language models continues to grow exponentially and seemingly without bound. Last February, Microsoft announced what was, at the time, the largest language model ever trained. Named Turing-NLG, it consists of 17 billion parameters. But fanfare was overshadowed by the unveiling in May of GPT-3, OpenAI’s eye-popping 175 BILLION parameter behemoth. This model could seemingly do it all, writing poems, op-eds, and even working code! But just last week, Google raised the stakes again with the Switch Transformer, describing a language model trained with ONE TRILLION parameters.

While these models exhibit jaw-droppingly impressive capabilities, this arms race of size harbors a disquieting number of growing ethical concerns:

training these models costs MILLIONS of dollars, ensuring that only the most wealthy tech companies can create them, calling into question the “democratization” of NLP
training requires massive amounts of electricity from typically non-renewable sources, leading some to question their worth from a standpoint of climate change
and perhaps most concerning of all, language models are rife with biases against marginalized groups in often subtle and pernicious ways

At Cloudera Fast Forward, we’ve always prioritized the ethics of machine learning, and while this year has seen the rise of the Mega-Models, my hope is that in 2021 the NLP community begins to more rigorously address the ethical concerns around training and using large language models. While such efforts have begun, one key to solidifying change would be an agreement within the community to prioritize new metrics and standards for the quality of language models, rather than merely competing for leaderboard accuracy.

Bridging the gap between research and production

Nisha

While machine learning continues to make strides on what can be accomplished with state-of-the-art research approaches, it has been increasingly evident that the industry faces tremendous challenges when it comes to deploying these applications. These challenges include - managing data pipelines, building end-to-end platforms for developing and deploying ML applications, model monitoring, building chips optimized for ML algorithms and inference and even more fundamental issues surrounding ethics and fairness.

In 2021 and coming years, I hope academia and the ML community as a whole can rise and drive focus to these interesting questions, especially when it comes to standardizing tools and frameworks and thus democratizing access to ML in a true sense. That said, we have already started to see multiple venues a practitioner could learn from including Stanford’s MLSys Seminar Series and ICML’s workshop - Challenges in Deploying and Monitoring ML Systems.

Advances in self supervised learning

Victor

Many real world problems are characterized by the availability of large datasets but few labels. Self supervised learning approaches such as contrastive learning, self training and generative modeling, provide practical pathways for learning from these data sources while avoiding costs associated with labelling. Two of the more interesting applications of self supervised learning I saw last year include CLIP - a model that learns visual concepts from natural language supervision and SWAV a model that learns visual features by cluster assignments. Both methods introduce new formulations of contrastive learning (applicable to multiple problem domains) and achieve performance at par with fully supervised learning models! For more information, see the the self supervised learning section of our recent blog post - Representation Learning 101 for Software Engineers.

The causal revolution continues

Chris

Causal inference has been on the periphery of the collective machine learning research agenda for some time, with some high profile debates in the respective causal and ML research communities (see, for instance, Towards Clarifying the Theory of the Deconfounder). While no single event last year marked a breakout success of causal reasoning in machine learning, it continues to gather attention, with Causal Learning featuring as the Breiman Lecture at NeurIPS, alongside dedicated workshops at several major conferences.

My favourite related work of the year was Underspecification Presents Challenges for Credibility in Modern Machine Learning. It points out a very practical problem: that highly parametrized models are very vulnerable to the kind of distributional changes that occur when moving between training and production deployment. The authors identify model underspecification as the source, and, while only barely talking about causality, offer a gateway into causal reasoning as a route to robustness (to my mind, the paper clearly illustrates the need for both).

At some point in the past, I viewed causal reasoning as academic and impractical. In the year 2020 it became abundantly clear to me that causal reasoning carries enormous implications for real world, deployed machine learning systems. As businesses continue to adopt ML-enabled systems, the need for understanding beyond correlation will only grow.

Much of the introductory causal learning literature is written for people with a background in statistical inference, rather than predictive machine learning systems. If you’re intrigued by causality, but don’t know where to start, I humbly suggest our report Causality for Machine Learning, which provides an on-boarding suitable for data scientists, machine learning engineers and technology leaders.

A causal question: which came first, the chicken or the egg?

That’s all from us this month. Thanks for reading!