We’re excited to introduce the latest report and prototype from our machine intelligence R&D group! In this iteration, we explore summarization, or neural network techniques for making unstructured text data computable.
Making language computable has been a goal of computer science research for decades. Historically, it has been a challenge to merely collect and store data. But it’s now so cheap to store data that we often have the opposite problem: once we’ve data, how should we analyze it to find meaning and insights?
Many organizations have made good headway processing structured, transactional data for Business Intelligence, but few have extended analytics to compress insights from the emails, news articles, reports, legal documents, and other troves of written documents that make up the lifeblood of organizations.
But we’re beginning to gain the ability to do remarkable things with unstructured text. Businesses that adopt this technology will see significant advantages. They will find important information faster. They will expand the horizons of how and what they read, gleaning actionable insights from document corpuses too large for humans to process.
Our work addresses multi- and single-document summarization, illustrating the best technical approaches with two prototypes. Our multi-document prototype uses Latent Dirichlet Allocation to map topics and collect key points of view across thousands of Amazon product reviews.
Our single-document prototype, Brief, uses skip-thoughts and recurrent neural networks to extract the sentences that best represent the key ideas in a longer document. You can see how Brief scores and highlights an article’s most interesting sentences in our public preview.
Our report records lessons we learned building our prototype, teaching readers:
- how different algorithms represent unstructured text quantitatively
- why recent breakthroughs in deep learning allow us to model meaning
- how to build a summarization system and setbacks to avoid
- who the key summarization vendors are and what they offer
- where natural language processing will go in the near future
We’re excited to help our clients identify opportunities to use these capabilities in their businesses, be that to facilitate research on investments, find documents relevant for a legal matter, manage email overload after vacation, or automatically generate tweets. What else do you imagine?
Please write to us at firstname.lastname@example.org if you’d like to learn more about our research subscriptions and advising services.
More from the Blog
Apr 6 2016
by — This is a guest post by Daniel Tunkelang, a data scientist and engineering executive, to preview the keynote he’ll deliver at our April 28 Data Leadership Conference in New York City! In 2012, Harvard Business Review proclaimed that “data scientist” was the sexiest job of the 21st century. That’s pretty amazing, considering that the job title was less than five years old at...
Apr 15 2016
We’re getting excited for our Data Leadership Conference, which is set for April 28 in New York City! The conference will feature an expert panel where Haile Owusu (Mashable), Claudia Perlich (Dstillery), and Kirk Borne (Booz Allen Hamilton) will share practical insights on how to build data capabilities within complex organizations. We’ll discuss questions like: What skills should organizat...
Aug 15 2017
by — The Tabula Rogeriana, a world map created by Muhammad al-Idrisi through traveler interviews in 1154. The Wikipedia corpus is one of the favorite datasets of the machine learning community. It is often used for experimenting, benchmarking and providing how-to examples. These experiments are generally presented separate from the Wikipedia user interface, however, which has remained true to the...