Pdf Hindi Translation 99492

Partial capture of text on file.
                    Domain Adaptation for Hindi-Telugu Machine Translation
                                  using Domain Specific Back Translation
                        Hema Ala                       Vandan Mujadia                  Dipti Misra Sharma
                           LTRC                              LTRC                               LTRC
                     IIIT Hyderabad                    IIIT Hyderabad                     IIIT Hyderabad
                 hema.ala@research.iiit.ac.in      vandan.mu@research.iiit.ac.in            dipti@iiit.ac.in
                                  Abstract                        now provided a new domain data, the chal-
                                                                  lenge is to improve the translation quality of
                   In this paper, we present a novel approach
                                                                  that domain using available little amount of
                   for domain adaptation in Neural Machine
                                                                  parallel domain data. We adopted two techni-
                   Translation which aims to improve the
                                                                  cal domains namely, Chemistry and Artificial
                   translation quality over a new domain.
                                                                  Intelligence for Hindi -> Telugu and Telugu ->
                   Adapting new domains is a highly challeng-
                   ing task for Neural Machine Translation on     Hindi experiments.
                   limited data, it becomes even more diﬀi-
                                                                    The parallel data for the mentioned techni-
                   cult for technical domains such as Chem-
                                                                  cal domains is very less, hence we used back
                   istry and Artificial Intelligence due to spe-
                                                                  translation to create synthetic data. Instead of
                   cific terminology, etc. We propose Domain
                                                                  using synthetic data directly which may con-
                   Specific Back Translation method which
                                                                  tain lots of noise we used domain monolingual
                   uses available monolingual data and gen-
                                                                  data to create synthetic data in a different way
                   erates synthetic data in a different way.
                                                                  (see section 3.4) and used such that translation
                   This approach uses Out Of Domain words.
                   The approach is very generic and can be        of domain terms and context around them is
                   applied to any language pair for any do-
                                                                  accurate.
                   main.   We conduct our experiments on
                   Chemistry and Artificial Intelligence do-
                                                                  2 Background & Motivation
                   mains for Hindi and Telugu in both direc-
                   tions. It has been observed that the usage     As noted by Chu and Wang (2018) there are
                   of synthetic data created by the proposed
                                                                  two important distinctions to make in do-
                   algorithm improves the BLEU scores sig-
                                                                  main adaptation methods for Machine Trans-
                   nificantly.
                                                                  lation(MT). The first is based on data re-
                                                                  quirements, supervised adaptation relies on in-
                1 Introduction
                                                                  domain parallel data, and unsupervised adap-
                Neural Machine Translation (NMT) systems          tation has no such requirement.       There is
                achieved a breakthrough in translation qual-      also a difference between model-based and
                ity recently, by learning an end-to-end system    data-based methods.     Model-based methods
                (Bahdanauetal., 2014)(Sutskever et al., 2014).    make explicit modifications to the model ar-
                These systems perform well on the general do-     chitecture such as jointly learning domain
                main on which they trained, but they fails to     discrimination and translation (Britz et al.,
                produce good translations for a new domain        2017), interpolation of language modeling and
                the model is unaware of.                          translation (Gulcehre et al., 2015; Domhan
                  Adapting to a new domain is highly chal-        and Hieber, 2017) and domain control by
                lenging task for NMTsystems, itbecomeseven        adding tags and word features (Kobus et al.,
                more challenging when it comes to technical       2016).   Zeng et al. (2019) proposed itera-
                domains like Chemistry, Artificial Intelligence   tive dual domain adaptation framework for
                etc, as they contain many domain specific         NMT, which continuously fully exploits the
                words.   In a typical domain adaptation sce-      mutual complementarity between in domain
                nario like ours, we have a tremendous amount      and out domain corpora for translation knowl-
                of general data on which we train an NMT          edge transfer.   Apart from this Freitag and
                model, we can assume this as a baseline model,    Al-Onaizan (2016) proposed two approaches,
                                                               26
                                  Proceedings of Recent Advances in Natural Language Processing, pages 26–34
                                                          Sep 1–3, 2021.
                                             https://doi.org/10.26615/978-954-452-072-4_004
                one is to continue the training of the baseline    be discussed in detail in section 3.4.    Huck
                model(general model) only on the in-domain         et al. (2019) also created synthetic data us-
                data (domain data) and the other is to en-         ing OOV in a different way, whereas we used
                semble the continue model with the baseline        OODwords to create synthetic data.
                model at decoding time. Coming to the data-
                based methods for domain adaptation, it can        3 Methodology
                bedoneintwoways,combiningin-domainand
                                                                   As discussed in section 2 there are many ap-
                out-of-domain parallel corpora for supervised
                                                                   proaches for domain adaptation mainly di-
                adaptation (Luong et al., 2015) or by gener-
                                                                   vided into model-based and data-based meth-
                ating pseudo-parallel corpora from in-domain
                                                                   ods. However our approach falls under data-
                monolingual data for unsupervised adaptation
                                                                   based method, we discuss this in detail in sec-
                (Sennrich et al., 2015a; Currey et al., 2017).
                                                                   tion 3.3. Though, there exists many domain
                  Our approach follows a combination of
                                                                   adaptation works in MT, to the best of our
                both supervised and unsupervised approaches.
                                                                   knowledge there is no such work for Indian
                where we first combine domain data (Chem-
                                                                   languages especially which considers technical
                istry and Artificial Intelligence ) with general
                                                                   domains like Chemistry, Artificial Intelligence
                data, train a domain adaptation model. Then,
                                                                   etc. Hence there is a huge need to work on In-
                as an unsupervised approach we use available
                                                                   dian Languages where most of them are mor-
                domain monolingual data to back translate
                                                                   phologically rich and these type of domains
                and use to create domain adaptation model.
                                                                   (technical domains) to improve the translation
                Burlot and Yvon (2019) explained how we can
                                                                   of domain specific text that contain many do-
                usemonolingualdataeffectivelyinourMTsys-
                                                                   main terms etc.
                tems, Inspired from Burlot and Yvon (2019),
                                                                     Weconducted all our experiments for Hindi
                instead of just adding domain parallel data
                                                                   and Telugu in both directions for Chemistry
                which is very small in amount to general data
                                                                   and Artificial Intelligence.The language pair
                we used available domain monolingual data to
                                                                   (Hindi-Telugu) considered in our experiments
                generate synthetic parallel data.
                                                                   are morphologically rich therefore, there exists
                                                                   many post positions, inflections etc. In order
                  In Burlot and Yvon (2019) they have ana-
                                                                   to handle all these morphological inflections
                lyzed various ways to integrate monolingual
                                                                   we used Byte Pair Encoding (BPE), we can
                data in an NMT framework, focusing on their
                                                                   see detail explanation about BPE in section
                impact on quality and domain adaptation. A
                                                                   3.2.
                simple way to use monolingual data in MT
                is to turn it into synthetic parallel data and
                                                                   3.1   Neural Machine Translation
                let the training procedure run as usual (Bo-
                jar and Tamchyna, 2011), but this kind of syn-     NMT system attempts to find the condi-
                thetic data may contain huge noise which leads     tional probability of the target sentence with
                to performance degradation of domain data.         the given source sentence.     There exist sev-
                Therefore, we present an approach which gen-       eral techniques to parameterize these con-
                erates synthetic data in a way such that it is     ditional probabilities.Kalchbrenner and Blun-
                more reliable and improves the translation. In     som (2013) used combination of a convolution
                thecontextofphrase-basedstatisticalmachine         neural network and a recurrent neural net-
                translation Daumé Iii and Jagarlamudi (2011)       work , Sutskever et al. (2014) used a deep
                has noted that unseen (OOV) words account          Long Short Term Memory (LSTM) model,
                for a large portion of translation errors when     Cho et al. (2014) used an architecture similar
                switching to new domains, however this prob-       to the LSTM, and Bahdanauetal. (2014) used
                lem is still exist even in NMT as well. Con-       a more elaborate neural network architecture
                sidering this issue, inspired from Huck et al.     that uses an attention mechanism over the in-
                (2019) we proposed a novel approach called         put sequence. However all these approaches
                domain specific back translation which uses        are based on RNN’s and LSTM’s etc, but be-
                Out Of Domain(OOD) words to create syn-            cause of the characteristics of RNN, it is not
                thetic data from monolingual data which will       conducive to training data in parallel so that
                                                                27
                the model training time is often longer, by ad-
                dressing this issue Vaswani et al. (2017) pro-
                posed Transformer framework based on a self-
                                                                    Algorithm 1: Generic Algorithm for
                attention mechanism. Inspired from Vaswani
                                                                    Domain Specific Back Translation
                et al. (2017) we used Transformer architecture
                                                                     Let us say L1 and L2 are language pair
                in all our experiments.
                                                                      (translation can be done in both
                                                                      directions L1 -> L2 and L2 -> L1)
                3.2   Byte Pair Encoding
                                                                     1. Training Corpus : Take all available
                BPE(Gage, 1994) is a data compression tech-
                                                                      L1 - L2 data (except domain data)
                nique that substitutes the most frequent pair        2. Train two NMT models (1. L1 -> L2
                of bytes in a sequence with a byte that does
                                                                      [L1-L2] 2. L2 -> L1 [L2-L1])
                not occur within that data.     Using this we        3. for domain in all domains do
                can acquire the vocabulary of desired size              1.Take L1 domain data , list down
                and can handle rare and unknown words as
                                                                          all Out Of Domain words from L1
                well (Sennrich et al., 2015b). As Telugu and
                                                                          Training Corpus [say this is
                Hindi are morphologically rich languages, par-
                                                                          OODL1with respect to given
                ticularly Telugu being more Agglutinative lan-
                                                                          domain]
                guage, there is a need to handle post posi-             2.Take L2 domain data, list down
                tions and compound words, etc. BPE helps
                                                                          all Out Of Domain words from L2
                the same by separating suﬀix, prefix, and com-
                                                                          Training Corpus [say this is
                pound words. NMT with BPE made signifi-
                                                                          OODL2with respect to given
                cant improvements in translation quality for
                                                                          domain]
                low resource morphologically rich languages
                                                                     end
                (Pinnis et al., 2017). We also adopted the same      4. Now take monolingual data for L1
                for our experiments and got the best results
                                                                      and L2
                with a vocabulary size of 30000.                     5. for all domains do
                                                                        1. Get N sentences from L1
                3.3   Domain Adaptation
                                                                          monolingual data where OODL1
                Domainadaptation aims to improve the trans-
                                                                          are present [Mono-L1]
                lation performance of a model (trained on gen-          2 Get N sentences from L2
                eral data) on a new domain by leveraging the
                                                                          monolingual data where OODL2
                available domain parallel data. As discussed
                                                                          are present [Mono-L2]
                in section 2 there are multiple approaches to           3. Run L2-L1 on Mono-L2 to get
                do it broadly divided into model-based and
                                                                          Back Translated data for L1 -> L2
                data-based however, our approach falls under
                                                                         (BT[L1-L2]
                data-based methods, where one can combine               4. Run L1-L2 on Mono-L1 to get
                the available little amount of domain parallel
                                                                          Back Translated data for L2 -> L1
                data to general data. In this paper we show
                                                                         (BT[L1-L2])
                how usage of domain specific synthetic data
                                                                     end
                improves the translation performance signifi-        ∗. Steps to Extract OOD
                cantly. The main goal of this method is to use
                                                                      words(mentioned in step 3) for all
                domain-specific synthetic parallel data using
                                                                      domains for all languages:
                the approach mentioned in section (3.4) along        ∗. for word in unique words of domain
                with little amount of domain parallel data.
                                                                      data do
                                                                         ∗. if word not in unique words of
                3.4   Domain Specific Back Translation
                                                                          general data
                In our experiments we followed data-based ap-
                                                                         then that will be extracted as OOD
                proach, we combined domain data with gen-
                                                                          word with respect to that domain
                eral data and trained a new model as a domain        end
                adaptation model.
                Due to the fact that the domain data is very
                less we can use available monolingual data to
                                                                28
                   Domains      #Sentences    #Tkns(te)     #Unique Tkns(te)      #Tkns(hi)     #Unique Tkns(hi)
                    General       431975        5021240           443052           7995403            123716
                      AI           5272          57051             11900            89392              5479
                  Chemistry        3600          72166             10166            97243              6792
                                              Table 1: Parallel Data for Hindi - Telugu
                   Langs    #Sent #Tkns UTkns
                                                                    each step of the algorithm can be interpreted
                                                                    as follows.  step 1.   The training corpus is
                   Hindi    16345    175931    17405
                                                                    general data mentioned in Table 1.      step 2.
                  Telugu    39583    339612    86942
                                                                    We train 2 models using the training corpus
                           Table 2: Monolingual Data                from above step.     One from Hindi to Tel-
                                                                    ugu and the other is from Telugu to Hindi.
                                                                    These models can be treated as base mod-
                    Domain-Lang         #Sentencs    #Tkns
                                                                    els. step 3.   This step is to find out OOD
                       AI-Hindi           14014      438848
                                                                    words, this can be done as follows, In Algo-
                      AI-Telugu           22241      285234
                                                                    rithm 1 this step explained in detail at the
                   Chemistry-Hindi        28672      982700
                                                                    last. step 3.1 Get Unique words from gen-
                  Chemistry-Telugu        34322      425515
                                                                    eral corpus, say Gen-Unique for both the lan-
                                                                    guages step 3.2 Get Unique words from Chem-
                Table 3: Selected monolingual data for domain spe-
                                                                    istry corpus, Chem-Unique for both the lan-
                cific back translation
                                                                    guages step 3.3 Get Unique words from AI
                                                                    corpus, AI-Unique for both languages step
                                                                    3.4 Now, take each word from Chem-Unique
                generate synthetic parallel data. Leveraging
                                                                    and check that word in Gen-Unique If it not
                monolingual data attained significant improve-
                                                                    found then that can be considered as Chem-
                ments in NMT(Domhan and Hieber, 2017;
                                                                    istry OOD words. We get OOD Hindi and
                Burlot and Yvon, 2019; Bojar and Tamchyna,
                                                                    OODTelugu with respect to Chemistry. step
                2011; Gulcehre et al., 2015). Using back trans-
                                                                    3.5 take each word from AI-Unique and check
                lation we can generate synthetic parallel data
                                                                    that word in Gen-Unique If it not found then
                but that might be very noisy which will de-
                                                                    that can be considered as AI OOD words. We
                crease the domain specific translation perfor-
                                                                    get OOD words Hindi and OOD words Tel-
                mance. Hence we need an approach which ex-
                                                                    ugu with respect to AI. step 4. Take monolin-
                tracts only useful sentences and creates syn-
                                                                    gual data for both languages mentioned in 2.
                thetic data. Our approach addresses the same
                                                                    step 5. Extract sentences from Hindi monolin-
                by creating domain specific back translated
                                                                    gual data where Hindi OODwordsw.r.tChem-
                data using the algorithm mentioned in 1.
                                                                    istry are present[Chem-Mono-Hindi]. step 5.1
                   Domain-specific Back Translation tries to
                                                                    Extract sentences from Telugu monolingual
                improve overall translation quality, particu-
                                                                    data where Telugu OOD words w.r.t Chem-
                larly translation of domain terms and domain-
                                                                    istry are present[Chem-Mono-Telugu].       step
                specific context implicitly. The generic algo-
                                                                    5.2 Extract sentences from Hindi monolingual
                rithm for domain-specific back translation is
                                                                    data where Hindi OOD words w.r.t AI are
                described in Algorithm 1. The algorithm is
                                                                    present[AI-Mono-Hindi]. step 5.3 Extract sen-
                very generic and can be applied to any lan-
                                                                    tences from Telugu monolingual data where
                guagepairforanydomain. Inourexperiments,
                                                                    Telugu OOD words w.r.t AI are present[AI-
                we adopted two domains namely Chemistry
                                                                    Mono-Telugu].    step 6.   Run Hindi -> Tel-
                and Artificial Intelligence, one language pair
                                                                    ugu model from step 2 on Chem-mono-Hindi
                Hindi and Telugu in both directions.
                                                                    to get Back Translated data [BT-Chem-Hindi-
                Let us consider the mentioned languages in
                                                                    Telugu] step 6.1 Run Telugu -> Hindi model
                terms of algorithm mentioned in Algorithm 1
                                                                    fromstep2onChem-mono-TelugutogetBack
                where L1 as Hindi and L2 as Telugu, domains
                                                                    Translated data [BT-Chem-Telugu-Hindi] step
                are Chemistry and Artificial Intelligence. Now,
                                                                 29
The words contained in this file might help you see if this file matches what you are looking for:

...Domain adaptation for hindi telugu machine translation using specific back hema ala vandan mujadia dipti misra sharma ltrc iiit hyderabad research ac in mu abstract now provided a new data the chal lenge is to improve quality of this paper we present novel approach that available little amount neural parallel adopted two techni which aims cal domains namely chemistry and artificial over intelligence adapting highly challeng ing task on experiments limited it becomes even more dii mentioned cult technical such as chem very less hence used istry due spe create synthetic instead cific terminology etc propose directly may con method tain lots noise monolingual uses gen different way erates see section out words generic can be terms context around them applied any language pair do accurate main conduct our background motivation mains both direc tions has been observed usage noted by chu wang there are created proposed important distinctions make algorithm improves bleu scores sig methods tr...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area