Therefore render unto Caesar the things that are Caesar’s, and unto God the things that are God’s.
– Matthew, 22:21
We tend to think that recommender systems are old hat. ECommerce platforms like Amazon have been using techniques like collaborative filtering for years to help shoppers navigate vast catalogues by inferring consumer taste from past behavior. And yet, we’ve all experienced the limitations of these approaches (that time you bought a toilet bowl plunger to subsequently be flooded by recommendations for strange bathroom accessories). This may be a nuisance for consumers, but it doesn’t jeopardize eCommerce business models: only 35% of Amazon’s sales, for example, are driven from recommendations.
But what would happen if the stakes were higher? What kind of algorithmic creativity would it take to build a company with revenue based entirely on recommendations?
This is the challenge faced by Eric Colson’s team at Stitch Fix. Stitch Fix is an online “personal styling service” that selects clothing apparel for customers (they also have a solid technical blog). Instead of recommending items for shoppers to choose from, Stitch Fix goes a step further and simply ships items its algorithms and stylists choose to customers. The key to making this work, according to Colson, is to optimize the division of labor between human and machine.
Let’s start with your background. How did you become a data scientist?
My father was a physicist, and physics features the most exotic math out there. Apparently – although this was too early for me to remember – he used to tape handwritten exotic math symbols inside my crib when I was a baby. I suppose this early exposure shaped my future obsession with microeconomics, and particularly graphs. Graphs are so pure, so elegant: I love how the smooth lines clearly show the theoretical relationship between things like price and demand (even if it’s always more complex in practice). I was lucky, as I entered the job market just when it was becoming viable to do price things like elasticity in the tech industry. You could record transactions and, with a bit of statistics, actually capture elasticity curves. Once in business, I had to learn how to get the data required to build these graphs. That meant learning the engineering behind data warehouses and ETL. It meant learning more sophisticated stats. At Yahoo, it eventually meant teasing relationships out of big data to optimize insights for human decision making. And, finally, at Netflix, it meant building systems for machine consumption - that is, automated decision-making or ‘algorithms’. You could even say I became a data scientist as a natural evolution of my sole pursuit: to optimize microeconomic graphs.
How did you choose your title? Chief Algorithms Officer isn’t something you see every day.
I’m not the only one, trust me. Udi Manber was the former algorithms officer at Amazon. I came to Stitch Fix to assume an officer role, and played with different words. I consider the title to play a recruiting function to attract the right talent. Chief Analytics Officer sounded too much like BI. Chief Data Officer sounded like a CIO, connoting data storage. I did consider Chief Data Scientist, and I love the word, but think there’s still confusion about what it means – it’s hard to capture a role that combines statistics, engineering, and business problems, and it’s getting more complicated now that artificial intelligence is used to describe machine learning algorithms. At any rate, I chose the title because the essence of what we do is to work with algorithms.
What inspired you to leave Netflix for Stitch Fix?
Katrina Lake, Stitch Fix’s Founder and CEO, engaged me in early 2012 as an advisor. I had no intentions of joining at the time, but was drawn to the work because it sought to mix algorithmic decision making with human decision-making in a way I thought merited exploration. At Netflix, I started to appreciate the limitations of a purely algorithmic approach for complex business tasks with an algorithm we built to predict the popularity of our content. This was critical when we shifted from the DVD model to streaming. With DVDs, you just buy and manage the inventory, and you can always buy more if needed; with streaming, you license content from studios, and pay high-dollar fixed costs over a set time period. That means, if no one watches a movie, Netflix still pays. It was therefore important to pick movies and shows many people would watch to cover costs. We gathered data from every source we could to build these models. They usually did well but there were anomalies where non-mainstream films with average ratings performed way better than expected. But if we just looked at the website we immediately knew why: the box shots would have sexy women in provocative positions, or even branding with subliminal effects (like the yellow National Geographic covers). Humans see this immediately, but it was not a feature we thought to engineer at the time (deep learning was not what it is today). So we built a little app where Netflix employees could indicate whether a movie had a sexually appealing image on it – we added human input to our models where human judgment performs better than algorithms. This, of course, is key to fashion image data and at Stitch Fix the entire company’s success was predicated on the algorithms working well. After six months of advising Stitch Fix, I decided to join full time.
How does human-machine collaboration work at Stitch Fix?
The machine learning happens first, and we combine all sorts of algorithms for different sub tasks, be they neural nets, collaborative filtering, mixed effects models, naïve Bayes, etc. to do a first pass at recommending styles for individual customers. Machines are far more efficient than humans and we leverage them for the rote calculations in our process. We leave the other types of activities - like synthesizing ambient information, improvising, fostering a relationship with the customer, applying empathy - to humans. The final step is logistics to manage delivery. It’s a division of labor modeled after Daniel Kahneman’s two systems of thinking in Thinking, Fast and Slow. The machines take the calculations and probabilities; the humans take the intuition. But there are overlapping tasks they share.
When it comes to fashion, customers have a hard time saying what they like. They’d rather show what they like. The vast majority of customers pin examples of styles the like on Pinterest that we can process in two ways. Machines can vectorize the pinned image and compare it to our inventory using convolutional neural nets (AlexNet) to find items that are similar, with similarity measured as Euclidean distance. Once we have this short list of items, we pass this to humans who process ambient information, recognize if the recommended items are too similar to the pinned image, or even judge whether the images are more aspirational than literal and modify suggestions accordingly. Take something like a leopard print dress. Machines are very pedantic: they can distinguish leopard print from cheetah print, but don’t have the social sense to know that a woman who likes leopard print would very likely also like cheetah print.
It can take a long time and a lot of data to train a convolutional neural net. Does your team have any creative techniques to speed up the process?
We’ve experimented with generative adversarial networks for our image processing. You use one “discriminative” neural net to break down images and find the elements of style, and then a generative net to put the building blocks back together to reproduce the images. It’s pretty quick to train another machine to then distinguish between real and machine-generated images.
Visualization of samples from a generative adversarial net in a 2014 article by Goodfellow et al.
What about improving the performance of the human stylists?
Humans are classic overfitters. As Kahneman explains, we don’t naturally think statistically, but tend to build narratives around the details most prominent in our consciousness. Say a stylist receives a tall customer, suggests some item she thinks would be good for tall people, and receives positive feedback from the customer. She’ll then reinforce the association between tallness and the product feature from this one win, often overlooking the hundreds of other instances where this didn’t work. Statistics are therefore a wonderful (and necessary) support to mitigate associative fallacies. The other challenge with our system is that there’s often a long delay between a stylist’s decision to pick a certain item and feedback from the customer on whether they like it (or it’s the right size, fit, etc.). Humans work best with instantaneous, unambiguous feedback: they forget why they did something by the time we have input from customers. So we built a parallel, “labs” version of the interface to give stylists realtime feedback. We use historical shipment data to simulate how a customer would react to give this quick feedback.
What percent of your recommendations are informed by collaborative filtering?
It’s true that matrix factorization is the most intuitive place to start for product recommendations, but they tend to fall short in the fashion industry because trends change so fast. The stuff we have now will be gone next month, so we have really sparse matrices. Most clients are getting boxes only on a monthly - or even less frequent - basis. This means we only get feedback on about 5 items per month per client on average. As such, it’s better to use other features to inform our recommendations, and our techniques run the gambit from neural networks to mixed effects models. That said, there are customers who get more items more frequently and where we have enough information to compare them with others. In those cases, collaborative filtering works.
How do you organize your team to support this diversity of algorithms and workflows?
Five groups report into me, and they are designed for autonomy. They have as few people as possible to work with and have to build their own pipelines: they do the math, learn the algorithms, communicate with the business, manage projects. I think this is better because it’s faster (there’s no coordination cost so you can iterate faster), attracts top talent, and supports long-term systems as data scientists understand things deeply and can make gradual tweaks to code. Three teams – merchandise algorithms, styling algorithms, and client algorithms – align with current business functions. The Merchandise Algorithms team works with merchants on procurement. The Styling Algorithms team works with stylists - it’s the recommendation workflow. The Client Algorithms team works with marketing on acquisition and retention: are clients happy? What’s the right action to take with them? Next, the data platform team is responsible for the infrastructure required to make the different algorithms run. We don’t have data engineers writing ETL on purpose. The final team is the labs team, which handles experimental stuff that hasn’t yet made it into the product (they are exploring the use of NLP and computer vision algorithms). Labs also manages our tech branding, like the Multithreaded blog.
What’s next for Stitch Fix?
One exciting piece of news is that we now have five proprietary, data-driven clothing lines. The items are like Frankenstyles, combining attributes (sleeve length, flair, etc…) that work well for a particular demographic but may not yet coexist in a single piece. We refine the items based on feedback, tweaking until it works. These will be shipped soon!
Colson’s talk at FirstMark’s Data Driven NYC.
More from the Blog
May 23 2016
by — We’re excited for tomorrow’s online discussion about automatic text summarization! You can register here. During the event, Mohamed AlTantawy, CTO of Agolo, will explain the tech behind their summarization tool. In this blog post, he describes how his team evaluates the quality of summaries produced by their system. Join us tomorrow to learn more and ask questions about the approach! Evalua...
May 27 2016
This week, we had the pleasure of hosting an online talk with the team from Agolo, an NYC-based startup with a great product that automatically summarizes articles (and can even generate new titles for a group of summarized articles). During the event, we discussed: why summarization has suddenly become interestinghow different industries are applying the technologyhow algorithms work (LDA, wo...
May 15 2017
by with — Twitter users can retweet, like or reply to a tweet. If a tweet from a prominent account gets more replies than retweets then that’s usually Not Good. Take, for example, this recent tweet from United Airlines. It got three times more replies than retweets. Those replies are not filled with praise, and that ratio was a sign to United’s PR team that they were in trouble, in case it wasn’t alread...