125x Filetype PDF File size 0.18 MB Source: cle.org.pk
Proceedings of the Conference on Language & Technology 2009 English to Urdu Transliteration System Abbas Raza Ali and Madiha Ijaz Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Lahore, Pakistan {abbas.raza, madiha.ijaz}@nu.edu.pk Abstract English text discards it and as a result generated speech or translation lacks coherence. Urdu language processing applications encounter English to Urdu transliteration system is being non-Urdu text specifically English text frequently. The developed to eradicate this discrepancy as shown in accuracy of these systems e.g. machine translation, Table 1. text-to-speech etc. is highly undermined as they are unable to handle English text. One possibility could be Table 1: Effect of transliteration on Urdu TTS addition of multilingual language processing system capabilities in Urdu language processing applications Alex Nokiað Urdu Text so that they may handle English text also along with ö Urdu but this approach is quite taxing. Another َ approach to handle English text is to transliterate it With ð into Urdu automatically and then pass it on to the Transliterati Urdu language processing applications. on ö This paper describes English to Urdu Without transliteration system. First the mapping rules that are Transliterati ö!" ð used to generate Urdu text from English transcription on are discussed then syllabification, manual transliteration and Urduization phase is described and In order to develop English to Urdu transliteration finally the issues related to Out-Of-Vocabulary (OOV) system, first the rule-based approach employing are discussed. transliteration from English orthography to Urdu orthography was explored but soon it was realized 1. Introduction that it would not work well as there is no one-to-one mapping between English orthography and its Transliteration is a method of transcribing the corresponding sound e.g. // sound is represented words from one script to phonetically equivalent words using six different letter combinations i.e. motion in another script. Transliteration rules provide /mo.n/, ocean /o.n/, sure /.r/, she /i/, mapping for the letters of the source script alphabet to admission /æ.m .n/, machine /m.in/ etc [2]. So the letter of target script alphabet based on phonetic similarity. This process is very successful for pronunciation based transliteration was chosen as it transliteration of names of people, places, companies, produced better results. etc., because the translated dictionaries can never be An Arpabet based English pronunciation lexicon comprehensive and are ineffective for translation of is used for acquiring pronunciation of English words. proper noun [1]. English text is converted to Urdu using English Websites, user interfaces etc. contain a lot of pronunciation and mapping rules. The English English text along with Urdu text. The text is read by pronunciation lexicon is based on American accent, the applications like screen reader, web page reader hence the transliteration into Urdu also depicts etc. and is passed to the Urdu language processing American accent. Frequently used English words are application e.g. machine translation for translation or transliterated manually and some rules are applied for text-to-speech system for speech generation. Urdu TTS Urduization of the transliterated text in order to make or machine translation system being unable to handle it appropriate and as close as possible to the local 15 Proceedings of the Conference on Language & Technology 2009 accent i.e. the accent that is used in Pakistan while Some sounds have multiple realizations in Urdu speaking English. orthography e.g. /s/ can be realized as %،$،# etc Out-Of-Vocabulary problem is resolved using statistical techniques by first aligning English so in this case only one most commonly used alphabet orthography to pronunciation sequences. Optimal is chosen which is $ in this case. Similar is the case pronunciation of an unknown word is computed by picking maximum probable pronunciation and then with /z/ as it can be realized as (،'،،& and /t/ passing it for the same transliteration process. which can be realized either as ) or *. The architecture of the English to Urdu Transliteration system is shown in figure 1. Vowels in Urdu are represented using diacritics i.e. zair, zaber and paish and four letters alif, wao, English Text choti yeh and bari yeh. Combination of diacritics with consonants forms short vowels while diacritics Converted to Load combined with alif, wao, choti yeh and bari yeh, form transliteration long vowels [3]. Same vowel is represented English OOV and language model differently in orthography, depending on whether it exists word initially, medially or finally. applying Computing Short vowels occurring word initially use alif as Syllabification Optimal َ -َ place holder e.g. “urban” is transliterated to + Pronunciation , applying Sequence /’r.bn/ but when they occur word medially they are represented only by the diacritics e.g. “justly” is Urduization َ transliterated to .12ð /’s.li/. 0 , Converted ِ Short vowels when occur word finally are transformed into their corresponding long vowel i.e. Urdu Script zabar is converted to alif e.g. “Andorra” /Ænd/ َ َ َ is transliterated to -345 /æn.’.r /, similarly zair is Figure 1: Architecture of English-to-Urdu transliteration converted to choti yeh and pesh is converted to wao. Hence there is no one-to-one correspondence 2. English to Urdu mapping between English and Urdu vowels in most of the cases and an English vowel is transliterated using CMU pronouncing dictionary (v 0.7a) is used to multiple Urdu characters depending on whether it acquire pronunciation of English words. The occurs word initially, medially or finally as shown dictionary comprises of 125,000 English words and table 2. their corresponding transcription in Arpabet. The Table 2: English vowels mapping to Urdu pronunciation provided is based on American accent orthography [11]. Urdu Arpabet IPA Initial Middle Final The phonemic inventory of English comprises of َ َ 24 consonants and 15 vowels. The phonemic inventory AA 6 7 7 of Urdu comprises of 37 consonants and 16 vowels َ َ (Appendix B). English consonants can be easily AE Æ 8 87 ے mapped to Urdu consonants and there is one-to-one : َ : َ AY A 6 7;7 7 correspondence between them in all cases. There are . ِ . some sounds in English e.g. dental fricatives, /Θ/ and ِ َ َ AW A ؤ6 ؤ7 ؤ7 /Ð/ which are non-existent in Urdu and hence they are َ َ َ mapped to their closest counterpart i.e. dental stops /t / AO 3 37 37 and /d/ respectively. OY 7;6 7;3 = ِ َ ِ َ EH " 8 87 ے 16 Proceedings of the Conference on Language & Technology 2009 َ َ َ .so.i. َ َ َ ER # - -7 -7 Associate ?@A ?BCA t َ EY E 8 8 ے â b.l .vi. َ َ 83.G Oblivious DE ِ َ IH 7 7 s ِ , ِ ِ ِ $ IY I 8 87 87 ِ ِ ِ 8ڈ I3 o.bi.di. َ ِ . Obedient 3 ,ِ ?54 َ OW O 3 3 3 nt , ?J ُ ُ ُ UH 7 7 َ َ َ AH 7 7 3.2. Special case After applying syllabification there exists problem of local accent, as transliteration is based on 3. Syllabification American accent so in order to make transliteration closer to Urdu accent some rules are applied on English-to-Urdu transliteration using CMU syllabified transliterated text. Pronunciation dictionary which is based on American accent, generates a lot of inconsistency. To improve 3.2.1. Consonant Cluster. Urdu syllabification does system’s accuracy; Urdu syllabification is applied on not allow consonant cluster in onset of the syllable English transcription as shown in table 3. and in the word medial position. In this case add / / if Consonant and Vowels combine to make syllable the second consonant is ‘r’ or ‘l’ otherwise // and breaking up a word into syllables is known as syllabification. Sonority sequence principle for between two consonants and mark syllable boundary syllabification is commonly used in Urdu. It requires after it as shown in table 4. the onset to rise in sonority towards the nucleus and Table 4: Examples of consonant cluster codas to fall in sonority from the nucleus [9]. problem English IPA Urdu IPA 3.1 Algorithm َ KL Treehoppe tIh p ِ َِ t .I.h .p# Template matching technique is used to syllabify r # English transcription. In this technique syllabication is َ M done by matching template of the form [4]. Bless bl"s b .l"s C0,1.V.Cn َ,ِ Urdu allows only one consonant in the onset position " . N Quickly kw kli 0 k.w k.li and multiple consonants can come in the coda position ِ ِ of the syllable. 1. Convert the entire phonemic transcription of the 3.2.2. Urduization. If two consonants come in the word to consonant-vowel pairs onset of the syllable in the word initial position and 2. Start from the end of the word, traverse backwards the starting consonant is ‘s’ then add // before ‘s’ as to find the next vowel 3. repeat shown in table 5. 4. if there is a consonant preceding it 5. mark a syllable boundary before consonant Table 5: Examples of Urduization applied on 6. else transliterated Urdu text 7. mark the syllable boundary before this vowel 8. end if English IPA Urdu IPA 9. until the entire string is consumed School skul O ُP s.kul ِ Table 3: Transliteration after applying Skill sk l QR s.k l syllabification َ َِ ِ Special sp"l QS s.p".l English IPA Urdu M ِ Unsyllabified Syllabified 17 Proceedings of the Conference on Language & Technology 2009 4. Out-Of-Vocabulary problem ^ (1) P =argmaxp(P | E ) = argmax p(E | P)p(P) r r n n r r P P r r Out-Of-Vocabulary is a very common problem in The trigram language model p(P .P .P ) and r r r various systems like text-to-speech, machine i−1 i i+1 bigram transliteration model p(E .P ) is combined to n r translation, cross language information retrieval i i (CLIR), etc. To resolve this problem, English phoneme maximize the pronunciation probabilityP . r to orthography alignment has to be found out probabilistically to get one-to-one mapping between 4.2. Computing optimal pronunciation them as shown in table 6, and then train those aligned sequence sequence to get most probable pronunciation for an unknown word. Expectation maximization algorithm is used to compute optimal alignment sequence. The algorithm Table 6: English orthography to pronunciation is given below; alignment Initialization English Percentages For each English phoneme to orthography pair, assign Pronunciatio p . er . s . eh . n . t . ih . jh . ah . z equal weights to all possibilities generated from (1). n repeat Alignment p(p) . er(er) . c(s) . e(eh) . n(n) . t(t) . Expectation-Step a(ih) . g(jh) . e(ah) . s(z) For each of the Arpabet phonemes, count up The entire procedure consists of two steps; instances of its different mappings from the · English orthography to pronunciation alignment. observations on all combinations produced in (1). · Computing optimal pronunciation sequence. Normalize the score so that the mapping After getting pronunciation of unknown text, it will probabilities sum to 1. be passed through the same procedure like Maximization-Step syllabification and then Urdu transliteration. The Recalculate the combination scores. Each architecture of the OOV module is shown in figure 2. combination is scored with the product of the scores of the symbol mappings it contains. Normalize the CMU pronunciation dictionary scores so that the mapping probabilities sum to 1. until convergence English orthography to phoneme alignment 5. Results Transliteration process becomes more accurate after applying syllabification on the pronunciation Pronunciation parameter estimation and finding probabilistic sequences of Out-Of- Vocabulary word problem as shown in figure 3. English-to-Urdu Mappings Pronunciation parameter optimization To achieve more accuracy Syllabification Bigram transliteration model and trigram language model probabilities To achieve more accuracy Figure 2: Architecture of English-to-Urdu Urduization Rules transliteration To enhance overall capability 4.1. Orthography to pronunciation alignment Out-Of-Vocabulary In this step all the valid combinations of English orthography E to its pronunciation sequence P are Figure 3: Modules of the system that lead it n r towards maturity produced using conditional probability; 18
no reviews yet
Please Login to review.