146x Filetype PDF File size 0.17 MB Source: aclanthology.org
Domain Adaptation for Hindi-Telugu Machine Translation using Domain Specific Back Translation Hema Ala Vandan Mujadia Dipti Misra Sharma LTRC LTRC LTRC IIIT Hyderabad IIIT Hyderabad IIIT Hyderabad hema.ala@research.iiit.ac.in vandan.mu@research.iiit.ac.in dipti@iiit.ac.in Abstract now provided a new domain data, the chal- lenge is to improve the translation quality of In this paper, we present a novel approach that domain using available little amount of for domain adaptation in Neural Machine parallel domain data. We adopted two techni- Translation which aims to improve the cal domains namely, Chemistry and Artificial translation quality over a new domain. Intelligence for Hindi -> Telugu and Telugu -> Adapting new domains is a highly challeng- ing task for Neural Machine Translation on Hindi experiments. limited data, it becomes even more diffi- The parallel data for the mentioned techni- cult for technical domains such as Chem- cal domains is very less, hence we used back istry and Artificial Intelligence due to spe- translation to create synthetic data. Instead of cific terminology, etc. We propose Domain using synthetic data directly which may con- Specific Back Translation method which tain lots of noise we used domain monolingual uses available monolingual data and gen- data to create synthetic data in a different way erates synthetic data in a different way. (see section 3.4) and used such that translation This approach uses Out Of Domain words. The approach is very generic and can be of domain terms and context around them is applied to any language pair for any do- accurate. main. We conduct our experiments on Chemistry and Artificial Intelligence do- 2 Background & Motivation mains for Hindi and Telugu in both direc- tions. It has been observed that the usage As noted by Chu and Wang (2018) there are of synthetic data created by the proposed two important distinctions to make in do- algorithm improves the BLEU scores sig- main adaptation methods for Machine Trans- nificantly. lation(MT). The first is based on data re- quirements, supervised adaptation relies on in- 1 Introduction domain parallel data, and unsupervised adap- Neural Machine Translation (NMT) systems tation has no such requirement. There is achieved a breakthrough in translation qual- also a difference between model-based and ity recently, by learning an end-to-end system data-based methods. Model-based methods (Bahdanauetal., 2014)(Sutskever et al., 2014). make explicit modifications to the model ar- These systems perform well on the general do- chitecture such as jointly learning domain main on which they trained, but they fails to discrimination and translation (Britz et al., produce good translations for a new domain 2017), interpolation of language modeling and the model is unaware of. translation (Gulcehre et al., 2015; Domhan Adapting to a new domain is highly chal- and Hieber, 2017) and domain control by lenging task for NMTsystems, itbecomeseven adding tags and word features (Kobus et al., more challenging when it comes to technical 2016). Zeng et al. (2019) proposed itera- domains like Chemistry, Artificial Intelligence tive dual domain adaptation framework for etc, as they contain many domain specific NMT, which continuously fully exploits the words. In a typical domain adaptation sce- mutual complementarity between in domain nario like ours, we have a tremendous amount and out domain corpora for translation knowl- of general data on which we train an NMT edge transfer. Apart from this Freitag and model, we can assume this as a baseline model, Al-Onaizan (2016) proposed two approaches, 26 Proceedings of Recent Advances in Natural Language Processing, pages 26–34 Sep 1–3, 2021. https://doi.org/10.26615/978-954-452-072-4_004 one is to continue the training of the baseline be discussed in detail in section 3.4. Huck model(general model) only on the in-domain et al. (2019) also created synthetic data us- data (domain data) and the other is to en- ing OOV in a different way, whereas we used semble the continue model with the baseline OODwords to create synthetic data. model at decoding time. Coming to the data- based methods for domain adaptation, it can 3 Methodology bedoneintwoways,combiningin-domainand As discussed in section 2 there are many ap- out-of-domain parallel corpora for supervised proaches for domain adaptation mainly di- adaptation (Luong et al., 2015) or by gener- vided into model-based and data-based meth- ating pseudo-parallel corpora from in-domain ods. However our approach falls under data- monolingual data for unsupervised adaptation based method, we discuss this in detail in sec- (Sennrich et al., 2015a; Currey et al., 2017). tion 3.3. Though, there exists many domain Our approach follows a combination of adaptation works in MT, to the best of our both supervised and unsupervised approaches. knowledge there is no such work for Indian where we first combine domain data (Chem- languages especially which considers technical istry and Artificial Intelligence ) with general domains like Chemistry, Artificial Intelligence data, train a domain adaptation model. Then, etc. Hence there is a huge need to work on In- as an unsupervised approach we use available dian Languages where most of them are mor- domain monolingual data to back translate phologically rich and these type of domains and use to create domain adaptation model. (technical domains) to improve the translation Burlot and Yvon (2019) explained how we can of domain specific text that contain many do- usemonolingualdataeffectivelyinourMTsys- main terms etc. tems, Inspired from Burlot and Yvon (2019), Weconducted all our experiments for Hindi instead of just adding domain parallel data and Telugu in both directions for Chemistry which is very small in amount to general data and Artificial Intelligence.The language pair we used available domain monolingual data to (Hindi-Telugu) considered in our experiments generate synthetic parallel data. are morphologically rich therefore, there exists many post positions, inflections etc. In order In Burlot and Yvon (2019) they have ana- to handle all these morphological inflections lyzed various ways to integrate monolingual we used Byte Pair Encoding (BPE), we can data in an NMT framework, focusing on their see detail explanation about BPE in section impact on quality and domain adaptation. A 3.2. simple way to use monolingual data in MT is to turn it into synthetic parallel data and 3.1 Neural Machine Translation let the training procedure run as usual (Bo- jar and Tamchyna, 2011), but this kind of syn- NMT system attempts to find the condi- thetic data may contain huge noise which leads tional probability of the target sentence with to performance degradation of domain data. the given source sentence. There exist sev- Therefore, we present an approach which gen- eral techniques to parameterize these con- erates synthetic data in a way such that it is ditional probabilities.Kalchbrenner and Blun- more reliable and improves the translation. In som (2013) used combination of a convolution thecontextofphrase-basedstatisticalmachine neural network and a recurrent neural net- translation Daumé Iii and Jagarlamudi (2011) work , Sutskever et al. (2014) used a deep has noted that unseen (OOV) words account Long Short Term Memory (LSTM) model, for a large portion of translation errors when Cho et al. (2014) used an architecture similar switching to new domains, however this prob- to the LSTM, and Bahdanauetal. (2014) used lem is still exist even in NMT as well. Con- a more elaborate neural network architecture sidering this issue, inspired from Huck et al. that uses an attention mechanism over the in- (2019) we proposed a novel approach called put sequence. However all these approaches domain specific back translation which uses are based on RNN’s and LSTM’s etc, but be- Out Of Domain(OOD) words to create syn- cause of the characteristics of RNN, it is not thetic data from monolingual data which will conducive to training data in parallel so that 27 the model training time is often longer, by ad- dressing this issue Vaswani et al. (2017) pro- posed Transformer framework based on a self- Algorithm 1: Generic Algorithm for attention mechanism. Inspired from Vaswani Domain Specific Back Translation et al. (2017) we used Transformer architecture Let us say L1 and L2 are language pair in all our experiments. (translation can be done in both directions L1 -> L2 and L2 -> L1) 3.2 Byte Pair Encoding 1. Training Corpus : Take all available BPE(Gage, 1994) is a data compression tech- L1 - L2 data (except domain data) nique that substitutes the most frequent pair 2. Train two NMT models (1. L1 -> L2 of bytes in a sequence with a byte that does [L1-L2] 2. L2 -> L1 [L2-L1]) not occur within that data. Using this we 3. for domain in all domains do can acquire the vocabulary of desired size 1.Take L1 domain data , list down and can handle rare and unknown words as all Out Of Domain words from L1 well (Sennrich et al., 2015b). As Telugu and Training Corpus [say this is Hindi are morphologically rich languages, par- OODL1with respect to given ticularly Telugu being more Agglutinative lan- domain] guage, there is a need to handle post posi- 2.Take L2 domain data, list down tions and compound words, etc. BPE helps all Out Of Domain words from L2 the same by separating suffix, prefix, and com- Training Corpus [say this is pound words. NMT with BPE made signifi- OODL2with respect to given cant improvements in translation quality for domain] low resource morphologically rich languages end (Pinnis et al., 2017). We also adopted the same 4. Now take monolingual data for L1 for our experiments and got the best results and L2 with a vocabulary size of 30000. 5. for all domains do 1. Get N sentences from L1 3.3 Domain Adaptation monolingual data where OODL1 Domainadaptation aims to improve the trans- are present [Mono-L1] lation performance of a model (trained on gen- 2 Get N sentences from L2 eral data) on a new domain by leveraging the monolingual data where OODL2 available domain parallel data. As discussed are present [Mono-L2] in section 2 there are multiple approaches to 3. Run L2-L1 on Mono-L2 to get do it broadly divided into model-based and Back Translated data for L1 -> L2 data-based however, our approach falls under (BT[L1-L2] data-based methods, where one can combine 4. Run L1-L2 on Mono-L1 to get the available little amount of domain parallel Back Translated data for L2 -> L1 data to general data. In this paper we show (BT[L1-L2]) how usage of domain specific synthetic data end improves the translation performance signifi- ∗. Steps to Extract OOD cantly. The main goal of this method is to use words(mentioned in step 3) for all domain-specific synthetic parallel data using domains for all languages: the approach mentioned in section (3.4) along ∗. for word in unique words of domain with little amount of domain parallel data. data do ∗. if word not in unique words of 3.4 Domain Specific Back Translation general data In our experiments we followed data-based ap- then that will be extracted as OOD proach, we combined domain data with gen- word with respect to that domain eral data and trained a new model as a domain end adaptation model. Due to the fact that the domain data is very less we can use available monolingual data to 28 Domains #Sentences #Tkns(te) #Unique Tkns(te) #Tkns(hi) #Unique Tkns(hi) General 431975 5021240 443052 7995403 123716 AI 5272 57051 11900 89392 5479 Chemistry 3600 72166 10166 97243 6792 Table 1: Parallel Data for Hindi - Telugu Langs #Sent #Tkns UTkns each step of the algorithm can be interpreted as follows. step 1. The training corpus is Hindi 16345 175931 17405 general data mentioned in Table 1. step 2. Telugu 39583 339612 86942 We train 2 models using the training corpus Table 2: Monolingual Data from above step. One from Hindi to Tel- ugu and the other is from Telugu to Hindi. These models can be treated as base mod- Domain-Lang #Sentencs #Tkns els. step 3. This step is to find out OOD AI-Hindi 14014 438848 words, this can be done as follows, In Algo- AI-Telugu 22241 285234 rithm 1 this step explained in detail at the Chemistry-Hindi 28672 982700 last. step 3.1 Get Unique words from gen- Chemistry-Telugu 34322 425515 eral corpus, say Gen-Unique for both the lan- guages step 3.2 Get Unique words from Chem- Table 3: Selected monolingual data for domain spe- istry corpus, Chem-Unique for both the lan- cific back translation guages step 3.3 Get Unique words from AI corpus, AI-Unique for both languages step 3.4 Now, take each word from Chem-Unique generate synthetic parallel data. Leveraging and check that word in Gen-Unique If it not monolingual data attained significant improve- found then that can be considered as Chem- ments in NMT(Domhan and Hieber, 2017; istry OOD words. We get OOD Hindi and Burlot and Yvon, 2019; Bojar and Tamchyna, OODTelugu with respect to Chemistry. step 2011; Gulcehre et al., 2015). Using back trans- 3.5 take each word from AI-Unique and check lation we can generate synthetic parallel data that word in Gen-Unique If it not found then but that might be very noisy which will de- that can be considered as AI OOD words. We crease the domain specific translation perfor- get OOD words Hindi and OOD words Tel- mance. Hence we need an approach which ex- ugu with respect to AI. step 4. Take monolin- tracts only useful sentences and creates syn- gual data for both languages mentioned in 2. thetic data. Our approach addresses the same step 5. Extract sentences from Hindi monolin- by creating domain specific back translated gual data where Hindi OODwordsw.r.tChem- data using the algorithm mentioned in 1. istry are present[Chem-Mono-Hindi]. step 5.1 Domain-specific Back Translation tries to Extract sentences from Telugu monolingual improve overall translation quality, particu- data where Telugu OOD words w.r.t Chem- larly translation of domain terms and domain- istry are present[Chem-Mono-Telugu]. step specific context implicitly. The generic algo- 5.2 Extract sentences from Hindi monolingual rithm for domain-specific back translation is data where Hindi OOD words w.r.t AI are described in Algorithm 1. The algorithm is present[AI-Mono-Hindi]. step 5.3 Extract sen- very generic and can be applied to any lan- tences from Telugu monolingual data where guagepairforanydomain. Inourexperiments, Telugu OOD words w.r.t AI are present[AI- we adopted two domains namely Chemistry Mono-Telugu]. step 6. Run Hindi -> Tel- and Artificial Intelligence, one language pair ugu model from step 2 on Chem-mono-Hindi Hindi and Telugu in both directions. to get Back Translated data [BT-Chem-Hindi- Let us consider the mentioned languages in Telugu] step 6.1 Run Telugu -> Hindi model terms of algorithm mentioned in Algorithm 1 fromstep2onChem-mono-TelugutogetBack where L1 as Hindi and L2 as Telugu, domains Translated data [BT-Chem-Telugu-Hindi] step are Chemistry and Artificial Intelligence. Now, 29
no reviews yet
Please Login to review.