Blog

Apr 28, 2017 · interview

Eli Bressert on Data-Driven Processes at Netflix

Despite their best intentions, companies often struggle to develop processes that provide data-driven decisions through partnership across the business. Netflix stands out as a company that excels at deeply integrating its data science teams into all aspects of the company. We spoke with Eli Bressert, Manager, Data Engineering and Analytics at Netflix, to learn more about how they create the culture and processes to support and sustain that integration.

Tell us about your background. What did you study?

I started out studying philosophy as an undergrad in Utrecht, with a desire to understand the basic roots of human knowledge. Taking courses in the history and philosophy of science, I discovered Karl Popper (known for his theory of falsifiability in scientific discourse) and Thomas Kuhn (known for his work on scientific revolutions), who inspired me to shift into science. The transition was tough, as math programs are strong in the Netherlands, so I had to play catch up and teach myself partial differential equations. That catch up turned into a great benefit where I fell in love in maths as well (double majored in astro and maths). But it was worth it, and I ended up focusing in astrophysics. My thirst to get my hands dirty observing the cosmos led me to Hawaii, which houses 13 large telescopes on Maunakea, one of the highest mountains in the world! It was crazy up there. There were nights with 10 -14 feet snow blizzards, and we had to brave the elements to put coolant in the telescopes. Thinking back to it always reminds me of the scene in The Empire Strikes Back where Luke Skywalker falls over sideways in the blizzard. At any rate, working with telescopes is intense – it’s being on the edge of physical human capability.

X-ray image of a pulsar

An X-ray image of a pulsar Eli made while working at the Chandra Space Telescope.

Astrophysics is a common gateway to data science: two members of the Fast Forward Labs research team followed the same path. How did you transition from astrophysics to data science?

After Hawaii, my wife and I moved to Boston, where I worked on making images from the Chandra Space Telescope for popular news outlets like CNN or BBC News. This involved learning color theory to transform spectra we cannot physically perceive (gamma, x-ray, or infrared) into aesthetically pleasing palettes to make space images interpretable. I quickly realized that I could automate most of my work with Python, so my path to Python enlightenment began. In a span of six years I ended up writing two well known astronomy packages with my astro-partner-in-crime (Tom Robitaille), wrote my PhD thesis, and wrote the O’Reilly book about SciPy and NumPy. And the final experience was the gateway drug that made me consider startups. The first startup that I worked with as an advisor was authorea.com, an online service to collaborate in LaTex and simplify the complexity of writing mathematical manuscripts. I loved the energy of the startup world, and joined the Insight Data Science program to give me a nudge into the professional world.

We work closely with Insight, and have mentored fellows on deep learning projects. What aspect of the experience was most valuable for you?

I had already spent years developing technical data science skills, so considered Insight to be means to accelerate my network to find a great job. Jake Klamka and his team are highly connected, and I wanted broad exposure to companies across the Bay Area. Moreover, the Insight program has an added benefit: the alumni cohorts usually become good friends and help each other grow collectively as their careers progress. I had the chance to learn from experienced practitioners like Monica Rogati and Hilary Mason and ended up leading a research for Stitch Fix (see a previous blog interview with Stitch Fix Chief Algorithms Officer Eric Colson).

Many consider Netflix to be the poster child for a data-driven company. Is that fact or fiction?

It’s fact: Netflix is incredibly data driven. Many people are shocked when they join the company because we don’t make any product changes without first analyzing data. We have multiple departments that use data in different ways. The platform team focuses on infrastructure and tools, Data Engineering & Analytics focuses on data engineering, ETL, and analytics, which offers self-service tools to business teams so they can make decisions autonomously. The Science and Algorithms team focuses on predictive modeling, algorithm research/prototyping, and experimentation (A/B testing). The Product Analytics team has developed an in-house A/B testing platform, Ignite, to get the right insights in front of the right people and provide easy-to-consume visualizations so users can view outputs on daily basis. I’m just mentioning a subset of how many teams are involved in the data space, so there’s a lot more!

Slide reading “Maximizing value of data product” with three items: self-service tools, analytics, a/b testing

Key components for maximizing the value of a data product. From Eli’s talk Data Over Matter.

What’s are a few examples of data-driven product decisions?

One example is customer retention. We track behaviors of customers on our site and test to see if different onboarding experiences, like creating profiles and suggesting movies users may enjoy, are correlated with longer term retention. Before going through with these onboarding changes, we performed A/B tests and really made sure what we changed would have an impact. A simple and illustrative example is the color of a button on the sign-up page. We might start with a hypothesis, drawn from color theory, that a green button will lead to more signups than a red button. If we identify a correlation to support this hypothesis, we’ll A/B test the various button colors, but won’t commit to the roll out until we feel there’s a causal relationship. This is a joint effort between DEA, engineers, PMs, and Science & Algorithms team and more.

Data science projects often fall short because they don’t deliver as much value as hoped for to the business. How do you make sure your data efforts stay relevant at Netflix?

Data teams at Netflix are organized to make sure we’re very tightly aligned with the company at large. This keeps us efficient (so we don’t get lost on dead end projects) and helps us hone judgment on where to focus our attention and what kinds of questions to ask about our data. Many data science and engineering friends at other companies have told me that business people view them as holding purely technical, rather than strategic roles. Netflix is such a successful data company precisely because data has been built into the culture and operations from day one, and leadership exposes the data team to the big picture, strategic questions. We’ll then orient our work to start exploring questions that are relevant to Netflix in the near and long term future, running some lightweight experiments and even deep analyses to draft memos for the leadership team communicating our early conclusions. These memos can be immensely impactful as a vehicle to get leadership on board to make changes suggested by our analyses. It’s all about scaling multiple levels of communication: memos to tell the high-level, big-picture story, a dashboard to provide additional insights, and the ability for users to dig deeper into the analytics if they like to probe and question the results.

Slide reading “Netflix Culture: Freedom and Responsibility”

Part of the Netflix culture is making sure people feel invested in their work and clear in their responsibilities. From Eli’s talk Data Over Matter.

Do you ever see tensions between different stakeholders on data teams? If so, how do you resolve them?

I’ve seen some confusion arise when there’s redundancy between two teams or ambiguity regarding who is responsible for what. I think it’s really important to give teams a north star, a goal they are aiming towards, as opposed to micromanaging their every move directly. To attract strong talent, and keep passion alive, it’s important that people feel they really own their work, and are working towards a rewarding goal. We filter for passion in our interview process and only bring on folks who are completely enthusiastic and passionate for the specific role they’re taking on, be that data science, analytics, or engineering. That said, we also hire for diversity, with some people bringing 15 years of experience in data engineering, and others just out of a PhD program. Again, the key is to make sure everyone has their north star, and then let their specific talents filter,emerge and grow.

What emerging machine learning capabilities are most exciting to you?

I’m curious to see how research in Gaussian processes (which Ryan Adams does a good job introducing on Talking Machines) develops. They used to be quite hard to scale, but I’ve seen a few papers recently that promise to change that. They’re extremely interpretable models that can used in many different ways. In deep learning, I’m most excited by game-playing algorithms like AlphaGo or Libratus given how they replicate probabilistic and strategic thinking. There are so many problems that we couldn’t resolve deterministically that are being unlocked by new probabilistic techniques.

You’re an avid reader, averaging a book per week. Any recent reads you’d recommend to others?

I loved Liu Cixin’s The Dark Forest, the sequel to The Three-Body Problem. There’s some pretty hard physics in there, and I find Cixin’s environmentalist critique extremely intriguing: Is the dark forest theory really the reality that we’re facing in our Universe? Last but not least, the third book in the series (Death’s End) has one of the best descriptions of 4D space ever written in a sci-fi novel. A must read by all means!

For more on data science at Netflix be sure to check out Eli’s talk Data Over Matter - Innovating the Next Generation of Data Products at Netflix.

Newer

May 15, 2017 · demo

A Quick Look at the Reply-to-Retweet Ratio

Older

Apr 14, 2017 · demo

Visualizing the Taste of a Community of Cinephiles Using t-SNE →

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

November 2022 Perhaps November conjures thoughts of holiday feasts and festivities, but for us, it’s the perfect time to chew the fat about machine learning! Make room on your plate for a peek behind the scenes into our current research on harnessing synthetic image generation to improve classification tasks. And, as usual, we reflect on our favorite reads of the month. New Research! In the first half of this year, we focused on natural language processing with our Text Style Transfer blog series.

Nov 14, 2022 · post

Implementing CycleGAN

by Michael Gallaspy · Introduction This post documents the first part of a research effort to quantify the impact of synthetic data augmentation in training a deep learning model for detecting manufacturing defects on steel surfaces. We chose to generate synthetic data using CycleGAN,1 an architecture involving several networks that jointly learn a mapping between two image domains from unpaired examples (I’ll elaborate below). Research from recent years has demonstrated improvement on tasks like defect detection2 and image segmentation3 by augmenting real image data sets with synthetic data, since deep learning algorithms require massive amounts of data, and data collection can easily become a bottleneck.

Oct 20, 2022 · newsletter

CFFL October Newsletter

October 2022 We’ve got another action-packed newsletter for October! Highlights this month include the re-release of a classic CFFL research report, an example-heavy tutorial on Dask for distributed ML, and our picks for the best reads of the month. Open Data Science Conference Cloudera Fast Forward Labs will be at ODSC West near San Fransisco on November 1st-3rd, 2022! If you’ll be in the Bay Area, don’t miss Andrew and Melanie who will be presenting our recent research on Neutralizing Subjectivity Bias with HuggingFace Transformers.

Sep 21, 2022 · newsletter

CFFL September Newsletter

September 2022 Welcome to the September edition of the Cloudera Fast Forward Labs newsletter. This month we’re talking about ethics and we have all kinds of goodies to share including the final installment of our Text Style Transfer series and a couple of offerings from our newest research engineer. Throw in some choice must-reads and an ASR demo, and you’ve got yourself an action-packed newsletter! New Research! Ethical Considerations When Designing an NLG System In the final post of our blog series on Text Style Transfer, we discuss some ethical considerations when working with natural language generation systems, and describe the design of our prototype application: Exploring Intelligent Writing Assistance.

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

by Michael Gallaspy · This post has a companion piece: Ethics Sheet for AI-assisted Comic Book Art Generation I want to make a comic book. Actually, I want to make tools for making comic books. See, the problem is, I can’t draw too good. I mean, I’m working on it. Check out these self portraits drawn 6 months apart: Left: “Sad Face”. February 2022. Right: “Eyyyy”. August 2022. But I have a long way to go until my illustrations would be considered professional quality, notwithstanding the time it would take me to develop the many other skills needed for making comic books.

Aug 18, 2022 · newsletter

CFFL August Newsletter

August 2022 Welcome to the August edition of the Cloudera Fast Forward Labs newsletter. This month we’re thrilled to introduce a new member of the FFL team, share TWO new applied machine learning prototypes we’ve built, and, as always, offer up some intriguing reads. New Research Engineer! If you’re a regular reader of our newsletter, you likely noticed that we’ve been searching for new research engineers to join the Cloudera Fast Forward Labs team.

Reports

In-depth guides to specific machine learning capabilities

FF24

Text Style Transfer

The NLP task of text style transfer (TST) aims to automatically control the style attributes of a piece of text while preserving the content, which is an important consideration for making NLP more user-centric. In this report, we explore text style transfer through an applied use case — neutralizing subjectivity bias in free text. Along the way, we describe our sequence-to-sequence modeling approach leveraging HuggingFace Transformers, and present a set of custom, reference-free evaluation metrics for quantifying model performance. Finally, we conclude with a discussion of ethics centered around our prototype: Exploring Intelligent Writing Assistance.

Read the report →

FF22

Inferring Concept Drift Without Labeled Data

Concept drift occurs when the statistical properties of a target domain change overtime causing model performance to degrade. Drift detection is generally achieved by monitoring a performance metric of interest and triggering a retraining pipeline when that metric falls below some designated threshold. However, this approach assumes ample labeled data is available at prediction time - an unrealistic constraint for many production systems. In this report, we explore various approaches for dealing with concept drift when labeled data is not readily accessible.

Read the report →

FF19

Session-based Recommender Systems

Being able to recommend an item of interest to a user (based on their past preferences) is a highly relevant problem in practice. A key trend over the past few years has been session-based recommendation algorithms that provide recommendations solely based on a user’s interactions in an ongoing session, and which do not require the existence of user profiles or their entire historical preferences. This report explores a simple, yet powerful, NLP-based approach (word2vec) to recommend a next item to a user. While NLP-based approaches are generally employed for linguistic tasks, here we exploit them to learn the structure induced by a user’s behavior or an item’s nature.

Read the report →

FF18

Few-Shot Text Classification

Text classification can be used for sentiment analysis, topic assignment, document identification, article recommendation, and more. While dozens of techniques now exist for this fundamental task, many of them require massive amounts of labeled data in order to be useful. Collecting annotations for your use case is typically one of the most costly parts of any machine learning application. In this report, we explore how latent text embeddings can be used with few (or even zero) training examples and provide insights into best practices for implementing this method.

Read the report →

Prototypes

Machine learning prototypes and interactive notebooks

Notebook

ASR with Whisper

Explore the capabilities of OpenAI's Whisper for automatic speech recognition by creating your own voice recordings!

https://colab.research.google.com/github/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb

Library

NeuralQA

A usable library for question answering on large datasets.

https://neuralqa.fastforwardlabs.com

Notebook

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

https://colab.research.google.com/drive/1tTiOgJ7xvy3sjfiFC9OozbjAX1ho8WN9?usp=sharing

Notebooks

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

https://qa.fastforwardlabs.com

Cloudera Fast Forward Labs

Making the recently possible useful.

Cloudera Fast Forward Labs is an applied machine learning research group. Our mission is to empower enterprise data science practitioners to apply emergent academic research to production machine learning use cases in practical and socially responsible ways, while also driving innovation through the Cloudera ecosystem. Our team brings thoughtful, creative, and diverse perspectives to deeply researched work. In this way, we strive to help organizations make the most of their ML investment as well as educate and inspire the broader machine learning and data science community.

Cloudera Blog Twitter

Apr 28, 2017 · interview

Eli Bressert on Data-Driven Processes at Netflix

An X-ray image of a pulsar Eli made while working at the Chandra Space Telescope.

Key components for maximizing the value of a data product. From Eli’s talk Data Over Matter.

Part of the Netflix culture is making sure people feel invested in their work and clear in their responsibilities. From Eli’s talk Data Over Matter.

For more on data science at Netflix be sure to check out Eli’s talk Data Over Matter - Innovating the Next Generation of Data Products at Netflix.

Read more

May 15, 2017 · demo

Apr 14, 2017 · demo

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

Nov 14, 2022 · post

Implementing CycleGAN

Oct 20, 2022 · newsletter

CFFL October Newsletter

Sep 21, 2022 · newsletter

CFFL September Newsletter

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

Aug 18, 2022 · newsletter

CFFL August Newsletter

Popular posts

Oct 30, 2019 · newsletter

Nov 14, 2018 · post

Apr 10, 2018 · post

Oct 4, 2017 · post

Aug 22, 2016 · whitepaper

Feb 24, 2016 · post

Reports

FF24

FF22

FF19

FF18

Prototypes

Notebook

Library

Notebook

Notebooks

Cloudera Fast Forward Labs