Yesterday, Google released new TensorFlow model code for text summarization, specifically for generating news headlines on the Annotated English Gigaword dataset. We’re excited to see others working on summarization, as we did in our last report: our ability to “digest large amounts of information in a compressed form” will only become more important as unstructured information grows.
The TensorFlow release uses sequence-to-sequence learning to train models that write headlines for news articles. Interestingly, the models output abstractive - not extractive - summaries. Extractive summarization involves weighing words/sentences in a document according to some metric, and then selecting those words/sentences with high scores as proxies for the important content in a document. Abstractive summarization looks more like a human-written summary: inputting a document and outputting the points in one’s own words. It’s a hard problem to solve.
Like the Facebook NAMAS model, the TensorFlow code works well on relatively short input data (100 words for Facebook; the first few sentences of an article for Google), but struggles to achieve strong results on longer, more complicated text. We faced similar challenges when we built Brief (our summarization prototype) and decided to opt for extractive summaries to provide meaningful results on long-form articles like those in the New Yorker or the n+1. We anticipate quick progress on abstractive summarization this year, given progress with recurrent neural nets and this new release.
If you’d like to learn more about summarization, contact us (email@example.com) to discuss our research report & prototype or come hear Mike Williams’ talk at Strata September 28!
More from the Blog
Aug 24 2016
with — Building shadows as proxies for construction rates in Shanghai. Photos courtesy of Orbital Insight/Digital Globe. It’s no small feat to commercialize new technologies that arise from scientific and academic research. The useful is a small subset of the possible, and the features technology users (let alone corporate buyers) care about rarely align with the problems researchers want to solve...
Aug 26 2016
by — This is a guest post featuring a project Patrick Doupe, now a Senior Data Analyst at Icahn School of Medicine at Mount Sinai, completed as a fellow in the Insight Data Science program. In our partnership with Insight, we occassionally advise fellows on month-long projects and how to build a career in data science. Machines are getting better at identifying objects in images. These technologies...
Jul 6 2017
by — This article contains highlights from a series of three interactive video tutorials on probabilistic programming from scratch published on O’Reilly Safari (login required). If you’re interested in the business case for probabilistic programming the Fast Forward Labs report discusses it in detail, and compares modern industrial strength systems like Stan and PyMC3. Please get in touch if you’re...