Sequence labeling tasks attempt to assign a categorical label to each member in the sequence. In natural language processing, where a sequence generally refers to a sentence, examples of sequence labeling include named entity recognition (NER), part-of-speech tagging (POS) and error detection. NER, as the name implies, tries to recognize names in a sentence and classify them into pre-defined labels such as Person and Organization. POS tagging assigns labels such as noun, verb, and adjective to each word, while error detection identifies grammatical errors in sentences. In many of these tasks, the relevant labels in the dataset are very sparse and most of the words contribute very little to the training process. But why let the data go to waste?
A recent paper proposes using multitask learning to make more use of the available data. In addition to assigning labels to each token (or words, loosely), the authors propose a model that also predicts the surrounding words in the dataset. By adding the secondary unsupervised objective, “the model is required to learn more general patterns of semantic and syntactic composition, which can be reused in order to predict individual labels more accurately”.
For the sequence modeling neural network, the authors take one sentence as input and use a bidirectional Long Short Term Memory network (LSTM) to assign a label to every token in the sentence. Each sentence is first tokenized and the resulting tokens are mapped into a sequence of word embeddings before being fed into the LSTM. Two LSTM components, moving in opposite directions (forward and backward) through the sentence, are then used for constructing context-dependent representations for every word. The hidden representations from both LSTMs are concatenated in order to obtain a context-specific representation for each word. This concatenated representation is passed through a feed-forward layer, allowing the model to learn features based on both context directions. To predict a label for each token, the authors use either a softmax or conditional random field (CRF) output architecture. Softmax predicts each label independently. CRF, on the other hand, handles dependencies between subsequent labels by looking for the best label sequence.
To predict the surrounding words, the authors cannot use the concatenated (forward and backward) representation because it contains information on both the previous word and next word. Instead, they use the pre-concatenated version. The hidden representation from the forward-moving LSTM is used to predict the next word; the hidden representation from the backward-moving LSTM is used to predict the previous word.
The architecture was evaluated on a range of datasets, covering the tasks of error detection, named entity recognition, chunking, and POS-tagging. Introducing a secondary task resulted in consistent performance improvements on every benchmark. The largest benefit was observed on the task of error detection - perhaps due to the very sparse and unbalanced label distribution in the dataset.
More from the Blog
Jun 29 2018
In our research reports, we cover “the recently possible,” and what makes “the recently possible” possible. In addition to a detailed how-to guide of new machine learning capabilities, each of our reports contains a section on open source projects, commercial offerings, and vendors that help implement this new machine learning capability to realize the opportunities opened up by technological i...
Jul 24 2018
by — We are excited to share the latest report and prototype from our machine intelligence R&D team: Multi-Task Learning. Wax on.. face off! When humans learn new tasks, we take advantage of knowledge we’ve gained from learning, or having learned, related tasks. Take the 1984 movie Karate Kid, where Mr. Miyagi takes on Daniel as his martial arts student. He begins Daniel’s training by having ...
Apr 3 2019
by — Many interesting learning problems exist in places where labeled data is limited. As such, much thought has been spent on how best to learn from limited labeled data. One obvious answer is simply to collect more data. That is valid, but for some applications, data is difficult or expensive to collect. If we will collect more data, we ought at least be smart about the data we collect. This motiv...