Blog

Jun 25, 2017 · post

F⁠ingerprinting documents with steganography

Steganography is the practice of hiding messages anywhere they’re not expected‏‎. I⁠n a well-executed piece of steganography, anyone who is not the intended recipient can look at the message and not realize its there at all‏‎. In a recent headline-making story, The I⁠ntercept inadvertently outed their source by publishing a document with an embedded steganographic message that allowed the NSA to identify the person who printed it‏‎.

These days, information is often hidden in digital media like images and audio files, where flipping a few bits doesn’t change the file to the human eye (or ear)‏‎. B⁠efore computers came along, though, there were plenty of messages creatively hidden in art, furniture, etc. There’s speculation that women in the U.S⁠. used to hide messages in their quilt work as a way to help escaped slaves find friendly homes‏‎. Neal S⁠tephenson riffs on this theme in his Quicksilver T⁠rilogy by having E⁠liza embed a binary code in her cross-stitching to smuggle information out of the court of L⁠ouis X⁠IV.

Hiding messages in text has always been especially challenging‏‎. T⁠here’s not much room to make changes without fundamentally altering the meaning of the original document, which in turn makes it obvious that something is amiss. If someone other than the intended recipient of the information realizes that there’s a message present at all, the steganography has, in some sense, failed‏‎.

What problem are we trying to solve?

In this post, I’ll talk about fingerprinting documents using text-based steganography‏‎. T⁠he problem we’re trying to solve is as follows‏‎. We have a sensitive document that must be distributed to some number of readers. Let’s say, for example, that Grandpa has decided to share his famous cookie recipe with each of his grandchildren‏‎. B⁠ut it’s super important to him that the recipe stays in the family! S⁠o they’re not allowed to share it with anyone else‏‎. I⁠f Grandpa finds pieces of his cookie recipe online later, he wants to know which grandchild broke the family trust.

To address this problem, he assigns each of his grandchildren an I⁠D, which is just a string of zeros and ones. Before he gives out the recipe, he identifies a number of ’branchpoints’ in the text‏‎. These are places where he can make a change without altering the grandchild’s experience of the recipe, or alerting them that something is amiss‏‎. One such branch point might be spelling out the numbers in the recipe - “ten ”instead of “10”‏‎. A⁠nother might be using imperial units instead of metric. T⁠his type of method is called a canary trap.

For each grandchild, he goes through the branchpoints one at a time‏‎. I⁠f the grandchild’s ID has a zero at some position, he does not make a change at the corresponding branch point. I⁠f it is a one, he makes the change‏‎.

N⁠ow, by looking at which changes had been made in the leaked cookie recipe, he should be able to identify which grandchild was the source of the leak.

H⁠ow does he find all the branchpoints he can use to effectively fingerprint the recipe?

Before we can answer that question, we’ll have to take a slight detour into the world of character encoding.

Digital character encoding

Computers think in binary, so when they save any symbol you might consider to be text, what they’re actually saving is some string of zeros and ones‏‎. The map that converts between binary and symbols is called a character encoding.

For a long time, the dominant character encoding was ASCII, which can only encode 256 characters‏‎. T⁠hese include upper and lower case English letters, numbers, and some punctuation.

A couple of decades ago, some folks got together and decided this wasn’t good enough, not least because people who don’t speak E⁠nglish should be able to use computers‏‎. They developed a specification called unicode that now includes over 120,000 different characters and has the capacity to expand to over one million‏‎.

Fortunately for us, there’s more room for hiding information these days than there used to be. We’ll see how we can take advantage of all those extra characters to find branchpoints in any document‏‎.

Identifying branchpoints

Some Unicode characters are more obviously useful than others. Take, for instance, the zero width space. It has some semantic significance - it tells whatever is rendering the text that it’s okay to put a line break somewhere, even if there’s no other whitespace character. For example, it will sometimes be used after a slash - it’s okay to start a new line after a slash, but if you don’t, there shouldn’t be a visible space‏‎.

So what happens if you put one of those zero-width spaces right in front of a normal, every day space? Absolutely nothing. I⁠t conveys no extra information, and doesn’t visibly change the text document at all. I⁠n fact, there’s a zero-width space in front of every space in this paragraph. Bet you couldn’t tell‏‎.

T⁠his means we can already treat every normal single space as a branch point, where we can choose whether or not to place a zero width space in front of it. D⁠epending on how much information you’re trying to encode, this may or may not be a good idea‏‎.

T⁠here are a number of other non-displaying characters that we could use in a similar way, but let’s move on to characters we can actually see.

When you have 120,000 characters, some of them are bound to look the same‏‎. H⁠ere’s an English character A, and here’s a Greek character Α. S⁠ee the difference?

Similar characters like these, called ’confusables’, are recognized as being dangerous enough that all modern browsers often some protection against letting you visit spoofed urls. Think you’re going to www.yahoo.com (all english characters)? Well, you may end up at ԝԝԝ.𝐲𝖺𝗵օօ.сօⅿ (no english characters) if you’re not careful.

H⁠ere’s a great unicode resource for identifying confusables.

U⁠sed judiciously, there are plenty of confusables that are, well, suitably confusing‏‎. H⁠ere are a few rules of thumb: simpler letters are more easily confused. For example, generally l-shaped things look more like each other than g-shaped things. Standalone, one letter words are harder to spot because they are separated by their neighbors by spaces, and so you don’t automatically visually juxtapose them with other characters‏‎. And, finally, how convincing your confusables are will depend to some degree on the font‏‎. Some typefaces may magnify the differences between confusables, while others will render confusables as more similar to each other. U⁠ltimately, you don’t want to change your readers’ experience of the text in any way, so it’s good to be careful with these.

But using funny characters in unicode is sometimes dangerous‏‎. In particular, if an unintended recipient of the message copies the text into an ASCI⁠I⁠-only editor, it won’t know what to make of those crazy unicode characters and they’ll probably just show up as ????????, which is a pretty good hint to the interloper that something strange is going on‏‎.

In the ASC⁠II-only world, your options are much more limited. I⁠n general, though, any time you make a stylistic decision that could go either way, you can consider that to be a branch point‏‎. For example, do you use single quotes or double quotes? D⁠o you spell out numbers, or do you use the numeric representations? If you want to be consistent throughout your document, each of these decisions will only get you one bit of hidden information. B⁠ecause you have fewer options, you’ll have to get more creative‏‎.

For example, we put five branchpoints in the following to produce a 5-bit message:

Ralphie set his secret decoder ring to “B ”and “twelve ”to decode the message‏‎. I⁠t said, “B⁠e sure to drink your Ovaltine”‏‎. (00000)
R⁠alphie set his secret decoder ring to ’B’ and ’twelve’ to decode the message. It said, “Be sure to drink your Ovaltine”. (10000)
R⁠alphie set his secret decoder ring to “B ”and “12 ”to decode the message‏‎. I⁠t said, “Be sure to drink your O⁠valtine”. (01000)
R⁠alphie set his secret decoder ring to “B ”and “twelve ”to decode the message. I⁠t said “B⁠e sure to drink your Ovaltine”‏‎. (00100)
Ralphie set his secret decoder ring to “B ”and “twelve ”to decode the message. It said, ’Be sure to drink your O⁠valtine’. (00010)
Ralphie set his secret decoder ring to “B ”and “twelve ”to decode the message. It said, “be sure to drink your Ovaltine”‏‎. (00001)
Ralphie set his secret decoder ring to ’B’ and ’12’ to decode the message. I⁠t said ’be sure to drink your Ovaltine’. (11111)

Introducing: S⁠teganos

In order to play around with these concepts, we created a tool called steganos. Steganos is packaged with a small library of branchpoints (pull requests for new branchpoints are welcome!) and has the ability to: calculate the number of encodable bits, encode/decode bits into text and do a partial recovery of bits from text snippets. A⁠ll this is possible by tracking the original unadulterated text as well as which branchpoints were available to steganos when the message was encoded.

As an example, using the current version of steganos, we can encode 1756 bits into this text‏‎. If we are using this for user-identification and expect to always see leaks of the full document, that means we can track 10^529 users (ie: vastly more than the number of people who have ever existed).

import steganos

message = '101'
original_text = '”Wow! ”they said.\n\t”This tool is really #1”'

capacity = steganos.bit_capacity(original_text) # == 10
encoded_text = steganos.encode(message, original_text)

recovered_bits = steganos.decode_full_text(encoded_text, original_text,
                                           message_bits=3)
# recovered_bits == '101'

partial_text = encoded_text[:8]  # only use 18% of the text
recovered_bits = steganos.decode_partial_text(partial_text, original_text,
                                              message_bits=3)
# recovered_bits == '1?1'

As an example, below is the opening to S⁠tar W⁠ars with and without a message hidden inside of it. Do you know which is the original?

It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire‏‎.
D⁠uring the battle, Rebel spies managed to steal secret plans to the E⁠mpire’s ultimate weapon, the D⁠EA⁠TH ST⁠A⁠R, an armored space station with enough power to destroy an entire planet‏‎.
Pursued by the Empire’s sinister agents, P⁠rincess L⁠eia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy‏‎...‏‎.

It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.
During the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.
Pursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy....

C⁠onclusion

H⁠ere we’ve seen a number of tricks we can use to fingerprint each individual copy of a document, without changing the reader’s experience or alerting them that they have a uniquely identifiable copy. There are a few practical considerations you’ll have to address if you go down this route - like how you identify the user from partial documents, or how you systematically mark pieces of text that cannot be changed without breaking the document (e.g. urls) - but these are mostly logistical issues.

Fingerprinting documents in this way can be a powerful tool in finding out who breached a confidentiality agreement. O⁠n the flip side, it can also be used to track people’s behavior in ways they haven’t agreed to, which is something to be cautious of‏‎. There’s a little too much of that going on on the internet as it is‏‎.

D⁠o you have ideas for other cool branchpoints? Let us know!

– Noam and Micha

T⁠hanks to Manny for his great edits!

P⁠S⁠: If you want to make sure you aren’t being tracked this way, simply make sure you only copy the AS⁠CII transliterated version of text! I⁠n some systems, this is done by selecting the “C⁠opy as P⁠lain T⁠ext ”option.

Newer

Jul 5, 2017 · post

Probabilistic programming from scratch

Older

Jun 6, 2017 · fiction

Probabilistic Programmming Sci-Fi: BayesHead 5000

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

November 2022 Perhaps November conjures thoughts of holiday feasts and festivities, but for us, it’s the perfect time to chew the fat about machine learning! Make room on your plate for a peek behind the scenes into our current research on harnessing synthetic image generation to improve classification tasks. And, as usual, we reflect on our favorite reads of the month. New Research! In the first half of this year, we focused on natural language processing with our Text Style Transfer blog series.

Nov 14, 2022 · post

Implementing CycleGAN

by Michael Gallaspy · Introduction This post documents the first part of a research effort to quantify the impact of synthetic data augmentation in training a deep learning model for detecting manufacturing defects on steel surfaces. We chose to generate synthetic data using CycleGAN,1 an architecture involving several networks that jointly learn a mapping between two image domains from unpaired examples (I’ll elaborate below). Research from recent years has demonstrated improvement on tasks like defect detection2 and image segmentation3 by augmenting real image data sets with synthetic data, since deep learning algorithms require massive amounts of data, and data collection can easily become a bottleneck.

Oct 20, 2022 · newsletter

CFFL October Newsletter

October 2022 We’ve got another action-packed newsletter for October! Highlights this month include the re-release of a classic CFFL research report, an example-heavy tutorial on Dask for distributed ML, and our picks for the best reads of the month. Open Data Science Conference Cloudera Fast Forward Labs will be at ODSC West near San Fransisco on November 1st-3rd, 2022! If you’ll be in the Bay Area, don’t miss Andrew and Melanie who will be presenting our recent research on Neutralizing Subjectivity Bias with HuggingFace Transformers.

Sep 21, 2022 · newsletter

CFFL September Newsletter

September 2022 Welcome to the September edition of the Cloudera Fast Forward Labs newsletter. This month we’re talking about ethics and we have all kinds of goodies to share including the final installment of our Text Style Transfer series and a couple of offerings from our newest research engineer. Throw in some choice must-reads and an ASR demo, and you’ve got yourself an action-packed newsletter! New Research! Ethical Considerations When Designing an NLG System In the final post of our blog series on Text Style Transfer, we discuss some ethical considerations when working with natural language generation systems, and describe the design of our prototype application: Exploring Intelligent Writing Assistance.

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

by Michael Gallaspy · This post has a companion piece: Ethics Sheet for AI-assisted Comic Book Art Generation I want to make a comic book. Actually, I want to make tools for making comic books. See, the problem is, I can’t draw too good. I mean, I’m working on it. Check out these self portraits drawn 6 months apart: Left: “Sad Face”. February 2022. Right: “Eyyyy”. August 2022. But I have a long way to go until my illustrations would be considered professional quality, notwithstanding the time it would take me to develop the many other skills needed for making comic books.

Aug 18, 2022 · newsletter

CFFL August Newsletter

August 2022 Welcome to the August edition of the Cloudera Fast Forward Labs newsletter. This month we’re thrilled to introduce a new member of the FFL team, share TWO new applied machine learning prototypes we’ve built, and, as always, offer up some intriguing reads. New Research Engineer! If you’re a regular reader of our newsletter, you likely noticed that we’ve been searching for new research engineers to join the Cloudera Fast Forward Labs team.

Reports

In-depth guides to specific machine learning capabilities

FF24

Text Style Transfer

The NLP task of text style transfer (TST) aims to automatically control the style attributes of a piece of text while preserving the content, which is an important consideration for making NLP more user-centric. In this report, we explore text style transfer through an applied use case — neutralizing subjectivity bias in free text. Along the way, we describe our sequence-to-sequence modeling approach leveraging HuggingFace Transformers, and present a set of custom, reference-free evaluation metrics for quantifying model performance. Finally, we conclude with a discussion of ethics centered around our prototype: Exploring Intelligent Writing Assistance.

Read the report →

FF22

Inferring Concept Drift Without Labeled Data

Concept drift occurs when the statistical properties of a target domain change overtime causing model performance to degrade. Drift detection is generally achieved by monitoring a performance metric of interest and triggering a retraining pipeline when that metric falls below some designated threshold. However, this approach assumes ample labeled data is available at prediction time - an unrealistic constraint for many production systems. In this report, we explore various approaches for dealing with concept drift when labeled data is not readily accessible.

Read the report →

FF19

Session-based Recommender Systems

Being able to recommend an item of interest to a user (based on their past preferences) is a highly relevant problem in practice. A key trend over the past few years has been session-based recommendation algorithms that provide recommendations solely based on a user’s interactions in an ongoing session, and which do not require the existence of user profiles or their entire historical preferences. This report explores a simple, yet powerful, NLP-based approach (word2vec) to recommend a next item to a user. While NLP-based approaches are generally employed for linguistic tasks, here we exploit them to learn the structure induced by a user’s behavior or an item’s nature.

Read the report →

FF18

Few-Shot Text Classification

Text classification can be used for sentiment analysis, topic assignment, document identification, article recommendation, and more. While dozens of techniques now exist for this fundamental task, many of them require massive amounts of labeled data in order to be useful. Collecting annotations for your use case is typically one of the most costly parts of any machine learning application. In this report, we explore how latent text embeddings can be used with few (or even zero) training examples and provide insights into best practices for implementing this method.

Read the report →

Prototypes

Machine learning prototypes and interactive notebooks

Notebook

ASR with Whisper

Explore the capabilities of OpenAI's Whisper for automatic speech recognition by creating your own voice recordings!

https://colab.research.google.com/github/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb

Library

NeuralQA

A usable library for question answering on large datasets.

https://neuralqa.fastforwardlabs.com

Notebook

Explain BERT for Question Answering Models

Tensorflow 2.0 notebook to explain and visualize a HuggingFace BERT for Question Answering model.

https://colab.research.google.com/drive/1tTiOgJ7xvy3sjfiFC9OozbjAX1ho8WN9?usp=sharing

Notebooks

NLP for Question Answering

Ongoing posts and code documenting the process of building a question answering model.

https://qa.fastforwardlabs.com

Cloudera Fast Forward Labs

Making the recently possible useful.

Cloudera Fast Forward Labs is an applied machine learning research group. Our mission is to empower enterprise data science practitioners to apply emergent academic research to production machine learning use cases in practical and socially responsible ways, while also driving innovation through the Cloudera ecosystem. Our team brings thoughtful, creative, and diverse perspectives to deeply researched work. In this way, we strive to help organizations make the most of their ML investment as well as educate and inspire the broader machine learning and data science community.

Cloudera Blog Twitter

Jun 25, 2017 · post

F⁠ingerprinting documents​ with steganography​

What problem are we trying to​ solve?

Digital character​ encoding

Identifying branchpoints​

Introducing: S⁠teganos​

C⁠onclusion

Read more

Jul 5, 2017 · post

Jun 6, 2017 · fiction

Latest posts

Nov 15, 2022 · newsletter

CFFL November Newsletter

Nov 14, 2022 · post

Implementing CycleGAN

Oct 20, 2022 · newsletter

CFFL October Newsletter

Sep 21, 2022 · newsletter

CFFL September Newsletter

Sep 8, 2022 · post

Thought experiment: Human-centric machine learning for comic book creation

Aug 18, 2022 · newsletter

CFFL August Newsletter

Popular posts

Oct 30, 2019 · newsletter

Nov 14, 2018 · post

Apr 10, 2018 · post

Oct 4, 2017 · post

Aug 22, 2016 · whitepaper

Feb 24, 2016 · post

Reports

FF24

FF22

FF19

FF18

Prototypes

Notebook

Library

Notebook

Notebooks

Cloudera Fast Forward Labs

F⁠ingerprinting documents with steganography

What problem are we trying to solve?

Digital character encoding

Identifying branchpoints

Introducing: S⁠teganos