We’ve got another action-packed newsletter for October! Highlights this month include the re-release of a classic CFFL research report, an example-heavy tutorial on Dask for distributed ML, and our picks for the best reads of the month.
Cloudera Fast Forward Labs will be at ODSC West near San Fransisco on November 1st-3rd, 2022! If you’ll be in the Bay Area, don’t miss Andrew and Melanie who will be presenting our recent research on Neutralizing Subjectivity Bias with HuggingFace Transformers. And be sure to stop by the Cloudera booth to say hello to the whole team!
If you’ve been following along with our newsletter, you may recall the blog series we presented earlier this year that explored in depth the natural language processing task of Text Style Transfer. We’re pleased as punch to share that we’ve repackaged those blogs into a one-stop-shop Research Report, updated with new figures and diagrams that make it even easier and more convenient to discover everything you want to know about this fascinating NLP capability!
Next up — check out this classic FFL report that we’ve released from The Vault! At first blush, the content here may seem to be geared more toward Computer Science than Data Science; however, as data get bigger and bigger, the line between these two disciplines tends to blur. Originally written in 2015, this report was ahead of its time as use cases and technologies for streaming applications in data science and machine learning have exploded in recent years. But the algorithms discussed in this report have stood the test of time and are still in use today!
The amount of time, memory, and computation required to do even simple operations on very large data sets can be extraordinary. In this report, we explore probabilistic methods for realtime stream analysis. These methods allow for quick and efficient calculations that approximate the results that form a batch analysis. In particular, this report examines the use cases, data structures, and algorithms for handling streaming data.
A probabilistic approach to algorithms accepts that simply adding continuously more computational power to a system is not realistic or scalable, eventually yielding diminishing returns. Instead, we must find a way to make our algorithms better. In many cases, this requires one to rephrase the questions we ask of our data and accept approximations. In exchange for this compromise, we can answer incredibly complex questions while using only a fraction of the resources we would have otherwise needed.
We teamed up again with our good buddy, Jake Bengston (Data Scientist turned Marketing Guru) to develop a quick tutorial on how to use Dask on CML in memory-constrained computing environments.
Tell us if this sounds familiar: You’ve found an awesome data set that you think will allow you to train a machine learning model that will accomplish your project’s goals; the only problem is the data is too big to fit in your compute environment. In the day and age of “big data,” most might think this issue is trivial, but like anything in the world of data science, things are hardly ever as straightforward as they seem. In this tutorial, you’ll learn how to distribute your ML workload in Cloudera Machine Learning when your data is too big or your workload is too complex to run on a single machine.
Check out replays of livestreams covering some of our research from last year.
CFFLers share their favorite reads of the month.
The authors in this study interview 19 machine learning engineers across a variety of industries and consolidate common practices associated with successful ML experimentation, deployment, and maintenance. In doing so, they also surface shared pain points that hint at the overall state of production ML and may influence future tool & product design. Some key takeaways include:
While this is a dense paper, I found it extremely helpful in gaining perspective in how nascent production ML actually is today — an insight that is non-obvious given the deluge of algorithmic advances cropping up every day in the ML research world. – Andrew
I’m really interested in the use of synthetic data for computer vision problems. 3D graphics systems already achieve impressively realistic results for natural scenes, and are constantly improving. The above paper demonstrates training a segmentation model on realistic renderings of everyday objects (in this case, food) that exhibit improved performance on real world datasets after fine-tuning, compared to a model trained on real word data alone. Given that data acquisition and annotation is often prohibitively difficult and expensive, I’m really excited to see techniques like this being investigated. — Mike
This month Jay Alammar provides the next installment of his “The Illustrated…” series, this time on the fascinating text-to-image capabilities of the Stable Diffusion model that is now openly accessible. If you’re curious how models like OpenAI’s DALL-E, Google’s Imagen, or Stable Diffusion work under the hood but don’t know where to start, then this is a must-read. Jay provides a solid foundation of the crucial concepts inherent to these models and helps the reader build a beginning intuition through clear explanations and great visualizations. — Melanie