147x Filetype PDF File size 0.24 MB Source: aclanthology.org
OnUsingVeryLargeTargetVocabularyfor Neural Machine Translation ´ Sebastien Jean KyunghyunCho YoshuaBengio ´ ´ RolandMemisevic Universite de Montreal ´ ´ Universite de Montreal CIFARSeniorFellow Abstract its translation. The whole neural network is jointly Neural machine translation, a recently trained to maximize the conditional probability of proposed approach to machine transla- a correct translation given a source sentence, us- tion based purely on neural networks, ing the bilingual corpus. The NMT models have has shown promising results compared to showntoperform as well as the most widely used the existing approaches such as phrase- conventional translation systems (Sutskever et al., based statistical machine translation. De- 2014; Bahdanau et al., 2015). spite its recent success, neural machine Neural machine translation has a number of translation has its limitation in handling advantages over the existing statistical machine a larger vocabulary, as training complex- translation system, specifically, the phrase-based ity as well as decoding complexity in- system (Koehn et al., 2003). First, NMT requires crease proportionally to the number of tar- a minimal set of domain knowledge. For instance, get words. In this paper, we propose all of the models proposed in (Sutskever et al., a method based on importance sampling 2014), (Bahdanau et al., 2015) or (Kalchbrenner that allows us to use a very large target vo- and Blunsom, 2013) do not assume any linguis- cabulary without increasing training com- tic property in both source and target sentences plexity. We show that decoding can be except that they are sequences of words. Sec- efficiently done even with the model hav- ond, the whole system is jointly trained to maxi- ing a very large target vocabulary by se- mizethetranslation performance, unlike the exist- lecting only a small subset of the whole ing phrase-based system which consists of many target vocabulary. The models trained separately trained features whose weights are then by the proposed approach are empirically tuned jointly. Lastly, the memory footprint of the found to match, and in some cases out- NMTmodelisoften much smaller than the exist- perform, the baseline models with a small ing system which relies on maintaining large ta- vocabulary as well as the LSTM-based bles of phrase pairs. neural machine translation models. Fur- Despite these advantages and promising results, thermore, when we use an ensemble of there is a major limitation in NMT compared to a few models with very large target vo- the existing phrase-based approach. That is, the cabularies, we achieve performance com- number of target words must be limited. This is parable to the state of the art (measured mainly because the complexity of training and us- by BLEU) on both the English→German ing an NMTmodelincreasesasthenumberoftar- and English→French translation tasks of get words increases. WMT’14. A usual practice is to construct a target vo- 1 Introduction cabulary of the K most frequent words (a so- called shortlist), where K is often in the range of Neural machine translation (NMT) is a recently 30k (Bahdanau et al., 2015) to 80k (Sutskever et introduced approach to solving machine transla- al., 2014). Any word not included in this vocab- tion (Kalchbrenner and Blunsom, 2013; Bahdanau ulary is mapped to a special token representing et al., 2015; Sutskever et al., 2014). In neural ma- an unknown word [UNK]. This approach works chine translation, one builds a single neural net- well when there are only a few unknown words work that reads a source sentence and generates in the target sentence, but it has been observed 1 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1–10, c Beijing, China, July 26-31, 2015. 2015 Association for Computational Linguistics that the translation performance degrades rapidly (y ,··· ,y ′) based on the encoded sequence of 1 T as the number of unknown words increases (Cho hidden states h: et al., 2014a; Bahdanau et al., 2015). p(y | y , x) ∝ exp{q(y , z , c )} , (2) In this paper, we propose an approximate train- t
no reviews yet
Please Login to review.