P15 1001

Partial capture of text on file.

                                      OnUsingVeryLargeTargetVocabularyfor
                                                 Neural Machine Translation
                            ´
                          Sebastien Jean       KyunghyunCho                        YoshuaBengio
                                                                                         ´           ´
                                   RolandMemisevic                             Universite de Montreal
                                            ´           ´
                                 Universite de Montreal                         CIFARSeniorFellow
                                     Abstract                          its translation. The whole neural network is jointly
                     Neural machine translation, a recently            trained to maximize the conditional probability of
                     proposed approach to machine transla-             a correct translation given a source sentence, us-
                     tion based purely on neural networks,             ing the bilingual corpus. The NMT models have
                     has shown promising results compared to           showntoperform as well as the most widely used
                     the existing approaches such as phrase-           conventional translation systems (Sutskever et al.,
                     based statistical machine translation. De-        2014; Bahdanau et al., 2015).
                     spite its recent success, neural machine            Neural machine translation has a number of
                     translation has its limitation in handling        advantages over the existing statistical machine
                     a larger vocabulary, as training complex-         translation system, speciﬁcally, the phrase-based
                     ity as well as decoding complexity in-            system (Koehn et al., 2003). First, NMT requires
                     crease proportionally to the number of tar-       a minimal set of domain knowledge. For instance,
                     get words.    In this paper, we propose           all of the models proposed in (Sutskever et al.,
                     a method based on importance sampling             2014), (Bahdanau et al., 2015) or (Kalchbrenner
                     that allows us to use a very large target vo-     and Blunsom, 2013) do not assume any linguis-
                     cabulary without increasing training com-         tic property in both source and target sentences
                     plexity. We show that decoding can be             except that they are sequences of words.     Sec-
                     efﬁciently done even with the model hav-          ond, the whole system is jointly trained to maxi-
                     ing a very large target vocabulary by se-         mizethetranslation performance, unlike the exist-
                     lecting only a small subset of the whole          ing phrase-based system which consists of many
                     target vocabulary.   The models trained           separately trained features whose weights are then
                     by the proposed approach are empirically          tuned jointly. Lastly, the memory footprint of the
                     found to match, and in some cases out-            NMTmodelisoften much smaller than the exist-
                     perform, the baseline models with a small         ing system which relies on maintaining large ta-
                     vocabulary as well as the LSTM-based              bles of phrase pairs.
                     neural machine translation models. Fur-             Despite these advantages and promising results,
                     thermore, when we use an ensemble of              there is a major limitation in NMT compared to
                     a few models with very large target vo-           the existing phrase-based approach. That is, the
                     cabularies, we achieve performance com-           number of target words must be limited. This is
                     parable to the state of the art (measured         mainly because the complexity of training and us-
                     by BLEU) on both the English→German               ing an NMTmodelincreasesasthenumberoftar-
                     and English→French translation tasks of           get words increases.
                     WMT’14.                                             A usual practice is to construct a target vo-
                 1   Introduction                                      cabulary of the K most frequent words (a so-
                                                                       called shortlist), where K is often in the range of
                 Neural machine translation (NMT) is a recently        30k (Bahdanau et al., 2015) to 80k (Sutskever et
                 introduced approach to solving machine transla-       al., 2014). Any word not included in this vocab-
                 tion (Kalchbrenner and Blunsom, 2013; Bahdanau        ulary is mapped to a special token representing
                 et al., 2015; Sutskever et al., 2014). In neural ma-  an unknown word [UNK]. This approach works
                 chine translation, one builds a single neural net-    well when there are only a few unknown words
                 work that reads a source sentence and generates       in the target sentence, but it has been observed
                                                                    1
                                Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
                                and the 7th International Joint Conference on Natural Language Processing, pages 1–10,
                                                              c
                                  Beijing, China, July 26-31, 2015. 
2015 Association for Computational Linguistics
                  that the translation performance degrades rapidly       (y ,··· ,y ′) based on the encoded sequence of
                                                                            1        T
                  as the number of unknown words increases (Cho           hidden states h:
                  et al., 2014a; Bahdanau et al., 2015).
                                                                              p(y | y   , x) ∝ exp{q(y       , z , c )} ,  (2)
                    In this paper, we propose an approximate train-               t

The words contained in this file might help you see if this file matches what you are looking for:

...Onusingverylargetargetvocabularyfor neural machine translation sebastien jean kyunghyuncho yoshuabengio rolandmemisevic universite de montreal cifarseniorfellow abstract its the whole network is jointly a recently trained to maximize conditional probability of proposed approach transla correct given source sentence us tion based purely on networks ing bilingual corpus nmt models have has shown promising results compared showntoperform as well most widely used existing approaches such phrase conventional systems sutskever et al statistical bahdanau spite recent success number limitation in handling advantages over larger vocabulary training complex system specically ity decoding complexity koehn first requires crease proportionally tar minimal set domain knowledge for instance get words this paper we propose all method importance sampling or kalchbrenner that allows use very large target vo and blunsom do not assume any linguis cabulary without increasing com tic property both sentences...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area