Yesterday, Google released new TensorFlow model code for text summarization, specifically for generating news headlines on the Annotated English Gigaword dataset. We’re excited to see others working on summarization, as we did in our last report: our ability to “digest large amounts of information in a compressed form” will only become more important as unstructured information grows.
The TensorFlow release uses sequence-to-sequence learning to train models that write headlines for news articles. Interestingly, the models output abstractive - not extractive - summaries. Extractive summarization involves weighing words/sentences in a document according to some metric, and then selecting those words/sentences with high scores as proxies for the important content in a document. Abstractive summarization looks more like a human-written summary: inputting a document and outputting the points in one’s own words. It’s a hard problem to solve.
Like the Facebook NAMAS model, the TensorFlow code works well on relatively short input data (100 words for Facebook; the first few sentences of an article for Google), but struggles to achieve strong results on longer, more complicated text. We faced similar challenges when we built Brief (our summarization prototype) and decided to opt for extractive summaries to provide meaningful results on long-form articles like those in the New Yorker or the n+1. We anticipate quick progress on abstractive summarization this year, given progress with recurrent neural nets and this new release.
If you’d like to learn more about summarization, contact us (email@example.com) to discuss our research report & prototype or come hear Mike Williams’ talk at Strata September 28!
More from the Blog
Aug 24 2016
with — Building shadows as proxies for construction rates in Shanghai. Photos courtesy of Orbital Insight/Digital Globe. It’s no small feat to commercialize new technologies that arise from scientific and academic research. The useful is a small subset of the possible, and the features technology users (let alone corporate buyers) care about rarely align with the problems researchers want to solve...
Aug 26 2016
by — This is a guest post featuring a project Patrick Doupe, now a Senior Data Analyst at Icahn School of Medicine at Mount Sinai, completed as a fellow in the Insight Data Science program. In our partnership with Insight, we occassionally advise fellows on month-long projects and how to build a career in data science. Machines are getting better at identifying objects in images. These technologies...
Aug 15 2017
by — The Tabula Rogeriana, a world map created by Muhammad al-Idrisi through traveler interviews in 1154. The Wikipedia corpus is one of the favorite datasets of the machine learning community. It is often used for experimenting, benchmarking and providing how-to examples. These experiments are generally presented separate from the Wikipedia user interface, however, which has remained true to the...