May 25, 2016 · interview

Human-Machine Algorithms: Interview with Eric Colson

Therefore render unto Caesar the things that are Caesar’s, and unto God the things that are God’s.
– Matthew, 22:21

We tend to think that recommender systems are old hat. ECommerce platforms like Amazon have been using techniques like collaborative filtering for years to help shoppers navigate vast catalogues by inferring consumer taste from past behavior. And yet, we’ve all experienced the limitations of these approaches (that time you bought a toilet bowl plunger to subsequently be flooded by recommendations for strange bathroom accessories). This may be a nuisance for consumers, but it doesn’t jeopardize eCommerce business models: only 35% of Amazon’s sales, for example, are driven from recommendations.

But what would happen if the stakes were higher? What kind of algorithmic creativity would it take to build a company with revenue based entirely on recommendations?

This is the challenge faced by Eric Colson’s team at Stitch Fix. Stitch Fix is an online “personal styling service” that selects clothing apparel for customers (they also have a solid technical blog). Instead of recommending items for shoppers to choose from, Stitch Fix goes a step further and simply ships items its algorithms and stylists choose to customers. The key to making this work, according to Colson, is to optimize the division of labor between human and machine.

Colson spoke about the value of human-machine collaborations at the March Data Driven NYC Meetup. We interviewed him to learn more about the approach.

Let’s start with your background. How did you become a data scientist?

My father was a physicist, and physics features the most exotic math out there.  Apparently – although this was too early for me to remember – he used to tape handwritten exotic math symbols inside my crib when I was a baby. I suppose this early exposure shaped my future obsession with microeconomics, and particularly graphs. Graphs are so pure, so elegant: I love how the smooth lines clearly show the theoretical relationship between things like price and demand (even if it’s always more complex in practice). I was lucky, as I entered the job market just when it was becoming viable to do price things like elasticity in the tech industry. You could record transactions and, with a bit of statistics, actually capture elasticity curves. Once in business, I had to learn how to get the data required to build these graphs. That meant learning the engineering behind data warehouses and ETL. It meant learning more sophisticated stats. At Yahoo, it eventually meant teasing relationships out of big data to optimize insights for human decision making. And, finally, at Netflix, it meant building systems for machine consumption - that is, automated decision-making or ‘algorithms’. You could even say I became a data scientist as a natural evolution of my sole pursuit: to optimize microeconomic graphs.

How did you choose your title? Chief Algorithms Officer isn’t something you see every day.

I’m not the only one, trust me. Udi Manber was the former algorithms officer at Amazon. I came to Stitch Fix to assume an officer role, and played with different words. I consider the title to play a recruiting function to attract the right talent. Chief Analytics Officer sounded too much like BI. Chief Data Officer sounded like a CIO, connoting data storage. I did consider Chief Data Scientist, and I love the word, but think there’s still confusion about what it means – it’s hard to capture a role that combines statistics, engineering, and business problems, and it’s getting more complicated now that artificial intelligence is used to describe machine learning algorithms. At any rate, I chose the title because the essence of what we do is to work with algorithms.

What inspired you to leave Netflix for Stitch Fix?

Katrina Lake, Stitch Fix’s Founder and CEO, engaged me in early 2012 as an advisor. I had no intentions of joining at the time, but was drawn to the work because it sought to mix algorithmic decision making with human decision-making in a way I thought merited exploration. At Netflix, I started to appreciate the limitations of a purely algorithmic approach for complex business tasks with an algorithm we built to predict the popularity of our content. This was critical when we shifted from the DVD model to streaming. With DVDs, you just buy and manage the inventory, and you can always buy more if needed; with streaming, you license content from studios, and pay high-dollar fixed costs over a set time period. That means, if no one watches a movie, Netflix still pays. It was therefore important to pick movies and shows many people would watch to cover costs. We gathered data from every source we could to build these models. They usually did well but there were anomalies where non-mainstream films with average ratings performed way better than expected. But if we just looked at the website we immediately knew why: the box shots would have sexy women in provocative positions, or even branding with subliminal effects (like the yellow National Geographic covers). Humans see this immediately, but it was not a feature we thought to engineer at the time (deep learning was not what it is today). So we built a little app where Netflix employees could indicate whether a movie had a sexually appealing image on it – we added human input to our models where human judgment performs better than algorithms. This, of course, is key to fashion image data and at Stitch Fix the entire company’s success was predicated on the algorithms working well. After six months of advising Stitch Fix, I decided to join full time.

How does human-machine collaboration work at Stitch Fix?

The machine learning happens first, and we combine all sorts of algorithms for different sub tasks, be they neural nets, collaborative filtering, mixed effects models, naïve Bayes, etc. to do a first pass at recommending styles for individual customers. Machines are far more efficient than humans and we leverage them for the rote calculations in our process. We leave the other types of activities - like synthesizing ambient information, improvising, fostering a relationship with the customer, applying empathy - to humans. The final step is logistics to manage delivery. It’s a division of labor modeled after Daniel Kahneman’s two systems of thinking in Thinking, Fast and Slow. The machines take the calculations and probabilities; the humans take the intuition. But there are overlapping tasks they share.

For example?

When it comes to fashion, customers have a hard time saying what they like. They’d rather show what they like. The vast majority of customers pin examples of styles the like on Pinterest that we can process in two ways. Machines can vectorize the pinned image and compare it to our inventory using convolutional neural nets (AlexNet) to find items that are similar, with similarity measured as Euclidean distance. Once we have this short list of items, we pass this to humans who process ambient information, recognize if the recommended items are too similar to the pinned image, or even judge whether the images are more aspirational than literal and modify suggestions accordingly.  Take something like a leopard print dress. Machines are very pedantic: they can distinguish leopard print from cheetah print, but don’t have the social sense to know that a woman who likes leopard print would very likely also like cheetah print.

It can take a long time and a lot of data to train a convolutional neural net. Does your team have any creative techniques to speed up the process?

We’ve experimented with generative adversarial networks for our image processing. You use one “discriminative” neural net to break down images and find the elements of style, and then a generative net to put the building blocks back together to reproduce the images. It’s pretty quick to train another machine to then distinguish between real and machine-generated images.

Visualization of samples from a generative adversarial net in a 2014 article by Goodfellow et al.

What about improving the performance of the human stylists?

Humans are classic overfitters. As Kahneman explains, we don’t naturally think statistically, but tend to build narratives around the details most prominent in our consciousness. Say a stylist receives a tall customer, suggests some item she thinks would be good for tall people, and receives positive feedback from the customer. She’ll then reinforce the association between tallness and the product feature from this one win, often overlooking the hundreds of other instances where this didn’t work. Statistics are therefore a wonderful (and necessary) support to mitigate associative fallacies. The other challenge with our system is that there’s often a long delay between a stylist’s decision to pick a certain item and feedback from the customer on whether they like it (or it’s the right size, fit, etc.). Humans work best with instantaneous, unambiguous feedback: they forget why they did something by the time we have input from customers. So we built a parallel, “labs” version of the interface to give stylists realtime feedback. We use historical shipment data to simulate how a customer would react to give this quick feedback.  

What percent of your recommendations are informed by collaborative filtering?

It’s true that matrix factorization is the most intuitive place to start for product recommendations, but they tend to fall short in the fashion industry because trends change so fast. The stuff we have now will be gone next month, so we have really sparse matrices.  Most clients are getting boxes only on a monthly - or even less frequent - basis. This means we only get feedback on about 5 items per month per client on average.  As such, it’s better to use other features to inform our recommendations, and our techniques run the gambit from neural networks to mixed effects models. That said, there are customers who get more items more frequently and where we have enough information to compare them with others. In those cases, collaborative filtering works.

How do you organize your team to support this diversity of algorithms and workflows?

Five groups report into me, and they are designed for autonomy. They have as few people as possible to work with and have to build their own pipelines: they do the math, learn the algorithms, communicate with the business, manage projects. I think this is better because it’s faster (there’s no coordination cost so you can iterate faster), attracts top talent, and supports long-term systems as data scientists understand things deeply and can make gradual tweaks to code. Three teams – merchandise algorithms, styling algorithms, and client algorithms – align with current business functions. The Merchandise Algorithms team works with merchants on procurement. The Styling Algorithms team works with stylists - it’s the recommendation workflow. The Client Algorithms team works with marketing on acquisition and retention: are clients happy? What’s the right action to take with them? Next, the data platform team is responsible for the infrastructure required to make the different algorithms run. We don’t have data engineers writing ETL on purpose. The final team is the labs team, which handles experimental stuff that hasn’t yet made it into the product (they are exploring the use of NLP and computer vision algorithms). Labs also manages our tech branding, like the Multithreaded blog.

What’s next for Stitch Fix?

One exciting piece of news is that we now have five proprietary, data-driven clothing lines. The items are like Frankenstyles, combining attributes (sleeve length, flair, etc…) that work well for a particular demographic but may not yet coexist in a single piece. We refine the items based on feedback, tweaking until it works. These will be shipped soon!

Colson’s talk at FirstMark’s Data Driven NYC.

Read more

May 27, 2016 · announcement
May 23, 2016 · guest post

Latest posts

Sep 22, 2021 · post

Automatic Summarization from TextRank to Transformers

by Melanie Beck · Automatic summarization is a task in which a machine distills a large amount of data into a subset (the summary) that retains the most relevant and important information from the whole. While traditionally applied to text, automatic summarization can include other formats such as images or audio. In this article we’ll cover the main approaches to automatic text summarization, talk about what makes for a good summary, and introduce Summarize. – a summarization prototype we built that showcases several automatic summarization techniques. more
Sep 21, 2021 · post

Extractive Summarization with Sentence-BERT

by Victor Dibia · In extractive summarization, the task is to identify a subset of text (e.g., sentences) from a document that can then be assembled into a summary. Overall, we can treat extractive summarization as a recommendation problem. That is, given a query, recommend a set of sentences that are relevant. The query here is the document, relevance is a measure of whether a given sentence belongs in the document summary. How we go about obtaining this measure of relevance varies (a common dilemma for any recommendation system). more
Sep 20, 2021 · post

How (and when) to enable early stopping for Gensim's Word2Vec

by Melanie Beck · The Gensim library is a staple of the NLP stack. While it primarily focuses on topic modeling and similarity for documents, it also supports several word embedding algorithms, including what is likely the best-known implementation of Word2Vec. Word embedding models like Word2Vec use unlabeled data to learn vector representations for each token in a corpus. These embeddings can then be used as features in myriad downstream tasks such as classification, clustering, or recommendation systems. more
Jul 7, 2021 · post

Exploring Multi-Objective Hyperparameter Optimization

By Chris and Melanie. The machine learning life cycle is more than data + model = API. We know there is a wealth of subtlety and finesse involved in data cleaning and feature engineering. In the same vein, there is more to model-building than feeding data in and reading off a prediction. ML model building requires thoughtfulness both in terms of which metric to optimize for a given problem, and how best to optimize your model for that metric! more
Jun 9, 2021 ·

Deep Metric Learning for Signature Verification

By Victor and Andrew. TLDR; This post provides an overview of metric learning loss functions (constrastive, triplet, quadruplet, and group loss), and results from applying contrastive and triplet loss to the task of signature verification. A complete list of the posts in this series is outlined below: Pretrained Models as Baselines for Signature Verification -- Part 1: Deep Learning for Automatic Offline Signature Verification: An Introduction Part 2: Pretrained Models as Baselines for Signature Verification Part 3: Deep Metric Learning for Signature Verification In our previous blog post, we discussed how pretrained models can serve as strong baselines for the task of signature verification. more
May 27, 2021 · post

Pretrained Models as a Strong Baseline for Automatic Signature Verification

By Victor and Andrew. Figure 1. Baseline approach for automatic signature verification using pretrained models TLDR; This post describes how pretrained image classification models can be used as strong baselines for the task of signature verification. The full list of posts in the series is outlined below: Pretrained Models as Baselines for Signature Verification -- Part 1: Deep Learning for Automatic Offline Signature Verification: An Introduction Part 2: Pretrained Models as Baselines for Signature Verification Part 3: Deep Metric Learning for Signature Verification As discussed in our introductory blog post, offline signature verification is a biometric verification task that aims to discriminate between genuine and forged samples of handwritten signatures. more

Popular posts

Oct 30, 2019 · newsletter
Exciting Applications of Graph Neural Networks
Nov 14, 2018 · post
Federated learning: distributed machine learning with data locality and privacy
Apr 10, 2018 · post
PyTorch for Recommenders 101
Oct 4, 2017 · post
First Look: Using Three.js for 2D Data Visualization
Aug 22, 2016 · whitepaper
Under the Hood of the Variational Autoencoder (in Prose and Code)
Feb 24, 2016 · post
"Hello world" in Keras (or, Scikit-learn versus Keras)


In-depth guides to specific machine learning capabilities


Machine learning prototypes and interactive notebooks


A usable library for question answering on large datasets.

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

Interpretability Revisited: SHAP and LIME

Explore how to use LIME and SHAP for interpretability.


Cloudera Fast Forward is an applied machine learning reseach group.
Cloudera   Blog   Twitter