Blog

Nov 21, 2019 · newsletter

Privacy-Preserving Machine Learning: A Primer

I have a bit of a cybersecurity background, so a year or two ago I started paying attention to how often data breaches happen - and I noticed something depressing: they happen literally 3+ times each day on average in the United States alone! (For a particularly sobering review check out this blog post.) All that information, your information, is out there, floating through the ether of the internet, along with much of my information: email addresses, old passwords, and phone numbers.

Governments, countries, and individuals - with good reason - are taking notice and making changes. The GDPR was groundbreaking in its comprehension, but hardly alone in its goal to shape policy surrounding individual data rights. More policies will pass as the notions of individual privacy and data rights are explored, defined, and refined.

So what does this mean for machine learning?

Machine learning algorithms are known to be data-hungry, with more sophisticated ML models requiring ever more data in order to learn their objectives (caveat: your ML model may be learning some unintentional things, too!). With the need for so much data, what happens to the privacy of data as it is processed through an ML workflow? Can data stay private? If so: how, and at what cost? This article is by no means comprehensive or exhaustive. It is meant to give the reader an introduction to privacy-preserving techniques developed specifically for ML environments. In future articles, I’ll dive deeper into each of these security topics.

But before we can understand security and privacy solutions, we have to understand some of the threats to data privacy imposed in machine learning workflows.

Threats to Data Privacy from Machine Learning

While there are certainly more threat vectors to ML workflows than we’ll cover here, for now we’ll focus on those that specifically adversely affect the privacy of the underlying data used to train an ML model. That means we won’t cover other types of attacks (like adversarial attacks or data poisoning) as these pertain more to the security of the model itself, rather than to the underlying data on which the model was trained.

There are three general steps when developing a machine learning application: extracting features from raw data, training a model on the extracted features, and publishing the trained model in order to gain new insights (wouldn’t be much good if the trained model sat unused on a dusty server somewhere!). Privacy concerns lurk at each step in the process. Here we’ll cover three of the most common attacks on data privacy in machine learning.

whattttt Threats inherent within the machine learning development cycle. Source: Privacy-Preserving Machine Learning: Threats and Solutions

Reconstruction Attacks

A reconstruction attack is when an adversary gains access to the feature vectors that were used to train the ML model and uses them to reconstruct the raw private data (i.e., your geographical location, age, sex, or email address). For this attack to work, the adversary must have white-box access to the ML model; that is, the internal state of the system can be easily understood or ascertained. Consider, for example, that some ML models (SVMs, KNNs) store the feature vectors as part of the model itself. If an attacker were to gain access to the model they would also have access to the feature vectors, and could very likely reconstruct the raw private data. Yikes!

Model Inversion Attacks

A model inversion attack is one step removed from the above. In this attack, the adversary only has access to the results produced from the trained ML model, not the feature vectors. However, many ML models report not only an “answer” (label, prediction, etc.) but a confidence score (probability, SVM decision value, etc.). For example, an image classifier might categorize an image as containing a panda with a confidence of .98. This additional piece of information - the confidence - opens the door for a model inversion attack. The adversary can now feed many examples into your trained model, record the output label and confidence, and use these to glean what feature vectors the original model was trained on in order to produce those results.

Membership Inference Attacks

This is a particularly pernicious attack. Even if you keep the original data and the feature vectors completely under lock and key (e.g., encrypted), and only report predictions without confidence scores, you’re still not 100% safe from a membership inference attack. In this attack, the adversary has a dataset and seeks to learn which of the samples might have been used to train the target ML model by comparing its output with the label associated with that sample. This type of attack is illustrated in the dashed box in the figure above. Successful attempts allow the attacker to learn which individual records were used to train the model, and thus identify an individual’s private data.

Anonymization is a privacy technique that sanitizes data by removing all personally identifying information before releasing the dataset for public use. Unfortunately, this technique also suffers from membership inference attacks as was demonstrated in the now-infamous Netflix case. The attack is the same as in the ML case: in that instance, researchers used the open-source internet movie database (IMBD) to infer the identities of the anonymized Netflix users.

Privacy-Preserving Solutions

So what are responsible machine learning enthusiasts and practitioners to do? Give up and go home? Obviously not. It’s time to present some solutions! While some of these ideas aren’t particularly new (for example, differential privacy has been around for at least 13 years), developing practical ML applications with these privacy techniques is still a nascent area of research. And while there is not yet a one-stop-shop for ML privacy on the market, this is definitely a space to follow as developments come quickly.

Secret sharing is a method of distributing a secret without exposing it to anyone else. At its most basic, this scheme works by assigning a “share” of the secret to each party. Only by pooling their shares together can they reveal the secret. This idea has been developed into a framework of protocols that make secure multi-party computation (SMPC) possible. In this paradigm, the “secret” is an entity’s private data which is typically secured cryptographically. SMPC then allows multiple entities (banks, hospitals, individuals) to pool their encrypted data together for training an ML model without exposing their data to other members of the party. This type of protection strongly protects against reconstruction attacks.

source

Differential Privacy

To thwart membership inference attacks, differential privacy (DP) was born. DP provides mathematical privacy guarantees and works by adding a bit of noise to the input data in order to mask whether or not a specific individual participated in the database. Specifically, the output will be statistically identical (within a very small tolerance) whether or not an individual’s data was present in the computation. However, the system has limits: every time the database is questioned (output is computed), a bit more privacy is lost. If the adversary is allowed to ask infinitely many questions of the database they could eventually successfully execute a membership inference attack. Thus, setting the “privacy budget” - or how much total information leakage is allowed - is crucial for safeguarding the data.

Homomorphic Encryption

Standard encryption techniques effectively make data completely useless. You can’t glean any information from it; you can’t compute on it; you can’t learn anything from it. That’s the point! It’s private, unless you know the encryption key. But wouldn’t it be great if there was a type of encryption that could keep data private but still allow basic computation without access to the secret key? Turns out, this exists and it’s called homomorphic encryption (HE). This type of encryption allows very simplistic computations to be performed on the encrypted data (addition, multiplication). When combined into more complex series of computations, simple ML models can be built and trained. There are drawbacks, of course - HE is computationally expensive.

Hardware-based Approaches

The methods discussed so far have all relied on algorithms and software to achieve data privacy, but that’s not the only approach. Security for code, data, and access can also be achieved through hardware such as Intel’s SGX processor. Essentially, its a standard CPU with beefed-up security instructions that allow developers to define private regions of memory, called enclaves, in the computer. These memory regions are internally encrypted and, as such, resist modification, reading, or tampering from external users or any processes existing outside the enclave.

Closing Thoughts

There’s a big world of privacy-preserving machine learning out there and this article just scratches the surface. The important take-away is that data privacy is a big deal - and getting bigger each year - as more and more high profile data breaches occur. It’s important to make thoughtful choices about how your ML applications will put privacy into practice. Different ML workflows will require different privacy paradigms, so thinking about this before jumping in is critical. The big questions you should ask of your ML application:

Who holds the data and what level of security is appropriate?
Who computes on the data (trains the ML model) and should they have access to the data?
Who hosts the trained ML model and does it require additional security to prevent inadvertent information leakage?

The answers to these questions will hopefully point you in the direction of one (or more!) of the privacy-preserving solutions presented above.

Here’s to a more secure world!

Newer

Dec 24, 2019 · newsletter

Squote - semantic quote search

Older

Oct 30, 2019 · newsletter

Fuzzy People

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

November 2022 Perhaps November conjures thoughts of holiday feasts and festivities, but for us, it’s the perfect time to chew the fat about machine learning! Make room on your plate for a peek behind the scenes into our current research on harnessing synthetic image generation to improve classification tasks. And, as usual, we reflect on our favorite reads of the month. New Research! In the first half of this year, we focused on natural language processing with our Text Style Transfer blog series.

Nov 14, 2022 · post

Implementing CycleGAN

by Michael Gallaspy · Introduction This post documents the first part of a research effort to quantify the impact of synthetic data augmentation in training a deep learning model for detecting manufacturing defects on steel surfaces. We chose to generate synthetic data using CycleGAN,1 an architecture involving several networks that jointly learn a mapping between two image domains from unpaired examples (I’ll elaborate below). Research from recent years has demonstrated improvement on tasks like defect detection2 and image segmentation3 by augmenting real image data sets with synthetic data, since deep learning algorithms require massive amounts of data, and data collection can easily become a bottleneck.

Oct 20, 2022 · newsletter

CFFL October Newsletter

October 2022 We’ve got another action-packed newsletter for October! Highlights this month include the re-release of a classic CFFL research report, an example-heavy tutorial on Dask for distributed ML, and our picks for the best reads of the month. Open Data Science Conference Cloudera Fast Forward Labs will be at ODSC West near San Fransisco on November 1st-3rd, 2022! If you’ll be in the Bay Area, don’t miss Andrew and Melanie who will be presenting our recent research on Neutralizing Subjectivity Bias with HuggingFace Transformers.

Sep 21, 2022 · newsletter

CFFL September Newsletter

September 2022 Welcome to the September edition of the Cloudera Fast Forward Labs newsletter. This month we’re talking about ethics and we have all kinds of goodies to share including the final installment of our Text Style Transfer series and a couple of offerings from our newest research engineer. Throw in some choice must-reads and an ASR demo, and you’ve got yourself an action-packed newsletter! New Research! Ethical Considerations When Designing an NLG System In the final post of our blog series on Text Style Transfer, we discuss some ethical considerations when working with natural language generation systems, and describe the design of our prototype application: Exploring Intelligent Writing Assistance.

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

by Michael Gallaspy · This post has a companion piece: Ethics Sheet for AI-assisted Comic Book Art Generation I want to make a comic book. Actually, I want to make tools for making comic books. See, the problem is, I can’t draw too good. I mean, I’m working on it. Check out these self portraits drawn 6 months apart: Left: “Sad Face”. February 2022. Right: “Eyyyy”. August 2022. But I have a long way to go until my illustrations would be considered professional quality, notwithstanding the time it would take me to develop the many other skills needed for making comic books.

Aug 18, 2022 · newsletter

CFFL August Newsletter

August 2022 Welcome to the August edition of the Cloudera Fast Forward Labs newsletter. This month we’re thrilled to introduce a new member of the FFL team, share TWO new applied machine learning prototypes we’ve built, and, as always, offer up some intriguing reads. New Research Engineer! If you’re a regular reader of our newsletter, you likely noticed that we’ve been searching for new research engineers to join the Cloudera Fast Forward Labs team.

Reports

In-depth guides to specific machine learning capabilities

FF24

Text Style Transfer

The NLP task of text style transfer (TST) aims to automatically control the style attributes of a piece of text while preserving the content, which is an important consideration for making NLP more user-centric. In this report, we explore text style transfer through an applied use case — neutralizing subjectivity bias in free text. Along the way, we describe our sequence-to-sequence modeling approach leveraging HuggingFace Transformers, and present a set of custom, reference-free evaluation metrics for quantifying model performance. Finally, we conclude with a discussion of ethics centered around our prototype: Exploring Intelligent Writing Assistance.

Read the report →

FF22

Inferring Concept Drift Without Labeled Data

Concept drift occurs when the statistical properties of a target domain change overtime causing model performance to degrade. Drift detection is generally achieved by monitoring a performance metric of interest and triggering a retraining pipeline when that metric falls below some designated threshold. However, this approach assumes ample labeled data is available at prediction time - an unrealistic constraint for many production systems. In this report, we explore various approaches for dealing with concept drift when labeled data is not readily accessible.

Read the report →

FF19

Session-based Recommender Systems

Being able to recommend an item of interest to a user (based on their past preferences) is a highly relevant problem in practice. A key trend over the past few years has been session-based recommendation algorithms that provide recommendations solely based on a user’s interactions in an ongoing session, and which do not require the existence of user profiles or their entire historical preferences. This report explores a simple, yet powerful, NLP-based approach (word2vec) to recommend a next item to a user. While NLP-based approaches are generally employed for linguistic tasks, here we exploit them to learn the structure induced by a user’s behavior or an item’s nature.

Read the report →

FF18

Few-Shot Text Classification

Text classification can be used for sentiment analysis, topic assignment, document identification, article recommendation, and more. While dozens of techniques now exist for this fundamental task, many of them require massive amounts of labeled data in order to be useful. Collecting annotations for your use case is typically one of the most costly parts of any machine learning application. In this report, we explore how latent text embeddings can be used with few (or even zero) training examples and provide insights into best practices for implementing this method.

Read the report →

Prototypes

Machine learning prototypes and interactive notebooks

Notebook

ASR with Whisper

Explore the capabilities of OpenAI's Whisper for automatic speech recognition by creating your own voice recordings!

https://colab.research.google.com/github/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb

Library

NeuralQA

A usable library for question answering on large datasets.

https://neuralqa.fastforwardlabs.com

Notebook

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

https://colab.research.google.com/drive/1tTiOgJ7xvy3sjfiFC9OozbjAX1ho8WN9?usp=sharing

Notebooks

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

https://qa.fastforwardlabs.com

Cloudera Fast Forward Labs

Making the recently possible useful.

Cloudera Fast Forward Labs is an applied machine learning research group. Our mission is to empower enterprise data science practitioners to apply emergent academic research to production machine learning use cases in practical and socially responsible ways, while also driving innovation through the Cloudera ecosystem. Our team brings thoughtful, creative, and diverse perspectives to deeply researched work. In this way, we strive to help organizations make the most of their ML investment as well as educate and inspire the broader machine learning and data science community.

Cloudera Blog Twitter

Nov 21, 2019 · newsletter

Privacy-Preserving Machine Learning: A Primer

Threats to Data Privacy from Machine Learning

Reconstruction Attacks

Model Inversion Attacks

Membership Inference Attacks

Privacy-Preserving Solutions

Secret Sharing and Secure Multi-Party Computation

Differential Privacy

Homomorphic Encryption

Hardware-based Approaches

Closing Thoughts

Read more

Dec 24, 2019 · newsletter

Oct 30, 2019 · newsletter

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

Nov 14, 2022 · post

Implementing CycleGAN

Oct 20, 2022 · newsletter

CFFL October Newsletter

Sep 21, 2022 · newsletter

CFFL September Newsletter

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

Aug 18, 2022 · newsletter

CFFL August Newsletter

Popular posts

Oct 30, 2019 · newsletter

Nov 14, 2018 · post

Apr 10, 2018 · post

Oct 4, 2017 · post

Aug 22, 2016 · whitepaper

Feb 24, 2016 · post

Reports

FF24

FF22

FF19

FF18

Prototypes

Notebook

Library

Notebook

Notebooks

Cloudera Fast Forward Labs