Welcome to the April Cloudera Fast Forward Labs newsletter, covering our latest research, livestream, and recommended reading.
Having taken a short break to work on Applied ML Prototypes, our first research cycle of 2021 is just now wrapping up, and we’re excited to share it soon. Here’s a preview:
Recommendation systems have become a cornerstone of modern life, from online retail to music and video streaming. In this report, we dig into session-based recommendations, a subset of recommendation systems that considers a user’s historical browsing information to generate recommendations. Specifically, we explore how to treat to this as a natural language problem and demonstrate how the now-classic Word2Vec algorithm can be retooled for the task.
Keep your eyes on our social channels (twitter here) for the report’s release in mid May.
Yesterday Melanie, Chris and Andrew hosted our second livestream. Fast Forward Live: Few-Shot Text Classification. For those who missed out, the replay is at that link, no sign up necessary.
Text classification can be used for sentiment analysis, topic assignment, document identification, article recommendation, and more. While dozens of techniques now exist for this fundamental task, many of them require massive amounts of labeled data in order to be useful. Collecting annotations for your use case is typically one of the most costly parts of any machine learning application. In the livestream, we saw how latent text embeddings can be used with few (or even zero) training examples and saw a live demo of it in action.
For more, check out our report: Few-Shot Text Classification.
You can also still catch a replay of our first livestream: Representation Learning for Software Engineers, where Victor and Andrew discussed the benefits of good representations, and how to learn them.
A few of our research engineers recommend some reading for this month:
10 Things You Need to Know About BERT and the Transformer Architecture that are Reshaping the AI Landscape Diving into a new area of research can be daunting as it requires significant time to understand not only state of the art approaches, but also the contextual history that has shaped the current methods. This is especially true of modern NLP concepts like BERT and the Transformer architecture. Luckily, this comprehensive blog post by neptune.ai does a fantastic job of summarizing the most important ideas in NLP today, including where this technology came from, how it was developed, and why it’s become ubiquitous. A great starting point for anyone curious about NLP. - Andrew
Proposal for a Regulation on a European approach for Artificial Intelligence The European Union has launched proposed regulations for AI systems (AI being broadly defined along use case, rather than technological, lines). This is a big deal, being the strongest regulation yet proposed. While it will be years before it becomes law, the proposal is strong: high risk AI applications will be heavily regulated (guidance on what constitutes “high risk” is contained within). There is much to digest and discuss, with arguments appearing in both directions: on the one hand, there are major loopholes, and on the other, where the regulation is strong, it may temper innovation. However, the clear thrust of the proposal is that AI systems must respect fundamental rights, a direction I strongly endorse. - Chris
Hidden Technical Debt in ML Systems
As researchers we are often focussed on improving performance based on new algorithmic approaches. In the process we may entirely ignore the cost that comes with increase in complexity of the system. And this is when the Hidden Technical Debt in ML Systems paper (although not recent) can serve as a good reminder of the questions one should ask to justify an investment of time and resources in such pursuits.
The paper seeks to uncover the trade-offs that must be considered in practice and focusses on system-level interactions and interfaces as an area where ML technical debt may rapidly accrue. These include:
While technical debt has been always talked about from an engineering perspective, it is equally relevant for researchers and that’s because they all have maintenance problems of traditional code plus an additional set of data and ML specific issues. A few useful questions that the paper suggests to consider:
- Nisha