Language Pdf 98344

Partial capture of text on file.
                                                                                                                                                                    Proceedings of the Conference on Language & Technology 2009 
                                                                                                                                                                              English to Urdu Transliteration System 
                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                 Abbas Raza Ali  and  Madiha Ijaz  
                                                                                                                                                                         Center for Research in Urdu Language Processing,  
                                                                                                                    National University of Computer and Emerging Sciences, Lahore, Pakistan 
                                                                                                                                                                                                        {abbas.raza, madiha.ijaz}@nu.edu.pk 
                                                                    
                                                                                                                                                     Abstract                                                                                                                                                English  text  discards  it  and  as  a  result  generated 
                                                                                                                                                                                                                                                                                                             speech or translation lacks coherence.  
                                                                               Urdu language processing applications encounter                                                                                                                                                                                            English  to  Urdu  transliteration  system  is  being 
                                                                   non-Urdu text specifically English text frequently. The                                                                                                                                                                                   developed to eradicate this discrepancy as shown in 
                                                                   accuracy  of  these  systems  e.g.  machine  translation,                                                                                                                                                                                 Table 1. 
                                                                   text-to-speech  etc.  is  highly  undermined  as  they  are                                                                                                                                                                                             
                                                                   unable to handle English text. One possibility could be                                                                                                                                                                                     Table 1: Effect of transliteration on Urdu TTS 
                                                                   addition                                   of                 multilingual                                             language                                      processing                                                                                                                                                system 
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                   capabilities in Urdu language processing applications                                                                                                                                                                                                                                                                              

Alex  	Nokiað
                                                                                                                                                                                                                                                                                                                    Urdu Text                                                                                                                                
                                                                   so that they may handle English text also along with                                                                                                                                                                                                                                                                                                                               ö		
                                                                   Urdu  but  this  approach  is  quite  taxing.  Another                                                                                                                                                                                                                                                                                                                               َ
                                                                   approach to handle English text is to transliterate it                                                                                                                                                                                           With                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                		

 	  ð
                                                                   into  Urdu  automatically  and  then  pass  it  on  to  the                                                                                                                                                                                      Transliterati                                                                                                                                             
                                                                   Urdu language processing applications.                                                                                                                                                                                                           on                                                                                                                                                             ö
                                                                               This                         paper                              describes                                       English                                  to                  Urdu                                                    Without 
                                                                   transliteration system. First the mapping rules that are                                                                                                                                                                                         Transliterati                                                                       ö!"

	ð
                                                                   used to generate Urdu text from English transcription                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                    on 
                                                                   are                        discussed                                         then                          syllabification,                                                      manual                                                    
                                                                   transliteration and Urduization phase is described and                                                                                                                                                                                                 In order to develop English to Urdu transliteration 
                                                                   finally the issues related to Out-Of-Vocabulary (OOV)                                                                                                                                                                                     system,  first  the  rule-based  approach  employing 
                                                                   are discussed.                                                                                                                                                                                                                            transliteration  from  English  orthography  to  Urdu 
                                                                                                                                                                                                                                                                                                             orthography  was  explored  but  soon  it  was  realized 
                                                                   1.  Introduction                                                                                                                                                                                                                          that it would not work well as there is no one-to-one 
                                                                                                                                                                                                                                                                                                             mapping  between  English  orthography  and  its 
                                                                                Transliteration  is  a  method  of  transcribing  the                                                                                                                                                                        corresponding  sound  e.g.  //  sound  is  represented 
                                                                   words from one script to phonetically equivalent words                                                                                                                                                                                    using  six  different  letter  combinations  i.e.  motion 
                                                                   in               another  script.  Transliteration  rules  provide                                                                                                                                                                        /mo.n/,  ocean  /o.n/,  sure  /.r/,  she  /i/, 
                                                                   mapping for the letters of the source script alphabet to                                                                                                                                                                                  admission /æ.m
.n/, machine /m.in/ etc [2]. So 
                                                                   the  letter  of  target  script  alphabet  based on phonetic 
                                                                   similarity.  This  process  is  very  successful  for                                                                                                                                                                                     pronunciation based transliteration was chosen as it 
                                                                   transliteration of names of people, places, companies,                                                                                                                                                                                    produced better results.  
                                                                   etc.,  because the translated dictionaries can never be                                                                                                                                                                                                An Arpabet based English pronunciation lexicon 
                                                                   comprehensive  and  are  ineffective  for  translation  of                                                                                                                                                                                is used for acquiring pronunciation of English words. 
                                                                   proper noun [1].                                                                                                                                                                                                                          English  text  is  converted  to  Urdu  using  English 
                                                                                Websites,  user  interfaces  etc.  contain  a  lot  of                                                                                                                                                                       pronunciation  and  mapping  rules.  The  English 
                                                                   English text along with Urdu text. The text is read by                                                                                                                                                                                    pronunciation lexicon is based on American accent, 
                                                                   the  applications  like  screen  reader,  web  page  reader                                                                                                                                                                               hence  the  transliteration  into  Urdu  also  depicts 
                                                                   etc.  and  is  passed  to  the  Urdu  language  processing                                                                                                                                                                                American accent. Frequently used English words are 
                                                                   application e.g. machine translation for translation or                                                                                                                                                                                   transliterated manually and some rules are applied for 
                                                                   text-to-speech system for speech generation. Urdu TTS                                                                                                                                                                                     Urduization of the transliterated text in order to make 
                                                                   or machine translation system being unable to handle                                                                                                                                                                                      it  appropriate  and  as  close  as  possible  to  the  local 
                                                                                                                                                                                                                                                                                        15 
                                                                                                                                                                    Proceedings of the Conference on Language & Technology 2009 
                                                                   accent  i.e.  the  accent  that  is  used  in  Pakistan  while                                                                                                                                                                                         Some sounds have multiple realizations in Urdu 
                                                                   speaking English.                                                                                                                                                                                                                         orthography e.g. /s/ can be realized as %،$،# etc 
                                                                                Out-Of-Vocabulary  problem  is  resolved  using 
                                                                   statistical                                  techniques  by  first  aligning  English                                                                                                                                                     so in this case only one most commonly used alphabet 
                                                                   orthography  to  pronunciation  sequences.  Optimal                                                                                                                                                                                       is chosen which is $ in this case. Similar is the case 
                                                                   pronunciation  of  an  unknown  word  is  computed  by 
                                                                   picking  maximum  probable  pronunciation  and  then                                                                                                                                                                                      with /z/ as it can be realized as (،'،،& and /t/  
                                                                   passing it for the same transliteration process.                                                                                                                                                                                          which can be realized either as ) or *. 
                                                                                The  architecture  of  the  English  to  Urdu 
                                                                   Transliteration system is shown in figure 1.                                                                                                                                                                                                           Vowels in Urdu are represented using diacritics 
                                                                                                                                                                                                                                                                                                             i.e.  zair,  zaber  and  paish  and  four  letters  alif,  wao, 
                                                                                              English Text                                                                                                                                                                                                   choti yeh and bari yeh. Combination of diacritics with 
                                                                                                                                                                                                                                                                                                             consonants  forms  short  vowels  while  diacritics 
                                                                                                                       Converted to                                                                                 Load                                                                                     combined with alif, wao, choti yeh and bari yeh, form 
                                                                                                                                                                                                     transliteration                                                                                         long  vowels  [3].  Same  vowel  is  represented 
                                                                                                       English                                                       OOV                               and language 
                                                                                                                                                                                                                  model                                                                                      differently  in  orthography,  depending  on  whether  it 
                                                                                                                                                                                                                                                                                                             exists word initially, medially or finally. 
                                                                                                                          applying                                                   Computing 
                                                                                                                                                                                                                                                                                                                          Short vowels occurring word initially use alif as 
                                                                                            Syllabification                                                                                                    Optimal                                                                                                                                                                                                                                                                                        َ -َ
                                                                                                                                                                                                                                                                                                             place  holder  e.g.  “urban”  is  transliterated  to  +  
                                                                                                                                                                                                     Pronunciation                                                                                                                                                                                                                                                                                            ,
                                                                                                                          applying                                                                           Sequence                                                                                        /’r.bn/ but when they occur word medially they are 
                                                                                                                                                                                                                                                                                                             represented  only  by  the  diacritics  e.g.  “justly”  is 
                                                                                                Urduization                                                                                                                                                                                                                                                                                 َ
                                                                                                                                                                                                                                                                                                             transliterated to .12ð /’s.li/.  
                                                                                                                                                                                                                                                                                                                                                                              0             ,
                                                                                                                          Converted                                                                                                                                                                                                                                           ِ
                                                                                                                                                                                                                                                                                                                          Short  vowels  when  occur  word  finally  are 
                                                                                                                                                                                                                                                                                                             transformed into their corresponding long vowel i.e. 
                                                                                                 Urdu Script                                                                                                                                                                                                 zabar is converted to alif e.g. “Andorra” /Ænd/ 
                                                                                                                                                                                                                                                                                                                                                                                       َ        َ         َ
                                                                                                                                                                                                                                                                                                             is transliterated to -345 /æn.’.r /, similarly zair is 
                                                                                Figure 1: Architecture of English-to-Urdu                                                                                                                                                                                                                                                                              
                                                                                                                                           transliteration                                                                                                                                                   converted to choti yeh and pesh is converted to wao.   
                                                                                                                                                                                                                                                                                                                          Hence  there  is  no  one-to-one  correspondence 
                                                                   2.  English to Urdu mapping                                                                                                                                                                                                               between  English  and  Urdu  vowels  in  most  of  the 
                                                                                                                                                                                                                                                                                                             cases  and  an  English  vowel  is  transliterated  using 
                                                                                CMU pronouncing dictionary (v 0.7a) is used to                                                                                                                                                                               multiple  Urdu  characters  depending  on  whether  it 
                                                                   acquire                               pronunciation                                                of                English                                words.                             The                                        occurs  word  initially,  medially  or  finally  as  shown 
                                                                   dictionary  comprises  of  125,000  English  words  and                                                                                                                                                                                   table 2. 
                                                                   their  corresponding  transcription  in  Arpabet.  The                                                                                                                                                                                            Table 2:  English vowels mapping to Urdu 
                                                                   pronunciation provided is based on American accent                                                                                                                                                                                                                                                                 orthography  
                                                                   [11].                                                                                                                                                                                                                                                                                                                                                                       Urdu
                                                                                                                                                                                                                                                                                                                 Arpabet                                            IPA                               Initial                             Middle                                   Final
                                                                                The phonemic inventory of English comprises of                                                                                                                                                                                                                                                                                                                                            َ                                    َ
                                                                   24 consonants and 15 vowels. The phonemic inventory                                                                                                                                                                                          AA                                                                                                            6                                   7                                  7
                                                                   of  Urdu  comprises  of  37  consonants  and  16  vowels                                                                                                                                                                                                                                                                                                       َ                                       َ
                                                                   (Appendix  B).  English  consonants  can  be  easily                                                                                                                                                                                         AE                                      Æ                                                               8                                   87                                     ے
                                                                   mapped to Urdu consonants and there is one-to-one                                                                                                                                                                                                                                                                                                          :                                           َ                             :      َ
                                                                                                                                                                                                                                                                                                                AY                                      A
                                                                   6                          7;7                                       7
                                                                   correspondence between them in all cases. There are                                                                                                                                                                                                                                                                                                    .                                        ِ                                .
                                                                   some sounds in English e.g. dental fricatives, /Θ/ and                                                                                                                                                                                                                                                                                                      ِ                                          َ                                    َ
                                                                                                                                                                                                                                                                                                                AW                                      A                                                                ؤ6                                  ؤ7                                 ؤ7
                                                                   /Ð/ which are non-existent in Urdu and hence they are                                                                                                                                                                                                                                                                                                          َ                                       َ                                    َ
                                                                   mapped to their closest counterpart i.e. dental stops /t /                                                                                                                                                                                AO                                                                                                        3                                   37                                   37
                                                                   and /d/  respectively.                                                                                                                                                                                                                      OY                                      
                                                           7;6                                  7;3                                  =
                                                                                                                                                                                                                                                                                                                                                                                                                              ِ   َ                                  ِ    َ
                                                                                                                                                                                                                                                                                                                EH                                      "                                                               8                                   87                                     ے
                                                                                                                                                                                                                                                                                        16 
                                                                                                                                                                    Proceedings of the Conference on Language & Technology 2009 
                                                                                                                                                                                        َ                                       َ                                     َ                                                                                      .so.i.                                                                   َ                 َ                                                َ
                                                                       ER                                      #                                                                  -                                  -7                                   -7                                                 Associate                                                                                                           ?@A                                   ?BCA
                                                                                                                                                                                                                                                                                                                                                             t                                                                                                                                                 َ
                                                                       EY                                      E
                                                              8                                        8                                ے                                                                                                                                                                                                                                       â
                                                                                                                                                                                                                                                                                                                                                             b.l
.vi.                                                                       َ            َ                   83.G
                                                                                                                                                                                                                                                                                                                  Oblivious                                                                                                                   DE                                                 ِ             َ
                                                                       IH                                      
                                                                                                        7                                   7                                                                                        s                                                                                            ِ ,
                                                                                                                                                                                        ِ                                       ِ                                     ِ                                                                                                                                                                                                                                   $
                                                                       IY                                      I                                                               8                                  87                                   87
                                                                                                                                                                                        ِ                                       ِ                                     ِ                                                                                                                                                                                                                    8ڈ I3
                                                                                                                                                                                                                                                                                                                                                             o.bi.di.                                                                     َ                                              ِ   .
                                                                                                                                                                                                                                                                                                                  Obedient                                                                                                                                    3                                            ,ِ
                                                                                                                                                                                                                                                                                                                                                                                                                                      ?54                                                                      َ
                                                                       OW                                      O                                                                 3                                        3                                    3                                                                                       nt                                                                                       ,                                          ?J
                                                                                                                                                                                        ُ                                      ُ                                     ُ                                                                                                                                            
                                                                       UH                                                                                                                                              7                                   7
                                                                                                                                                                                        َ                                       َ                                     َ                                                                                                                                           
                                                                       AH                                                                                                                                              7                                 7                                          3.2.  Special case 
                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                              After                        applying                                   syllabification                                               there                        exists 
                                                                                                                                                                                                                                                                                                             problem of local accent, as transliteration is based on 
                                                                   3.  Syllabification                                                                                                                                                                                                                       American accent so in order to make transliteration 
                                                                                                                                                                                                                                                                                                             closer  to  Urdu  accent  some  rules  are  applied  on 
                                                                                English-to-Urdu                                                           transliteration                                                 using                            CMU                                               syllabified transliterated text. 
                                                                   Pronunciation dictionary which is based on American                                                                                                                                                                                                         
                                                                   accent,  generates  a  lot  of  inconsistency.  To  improve                                                                                                                                                                               3.2.1. Consonant Cluster. Urdu syllabification does 
                                                                   system’s accuracy; Urdu syllabification is applied on                                                                                                                                                                                     not  allow  consonant cluster in onset of the syllable 
                                                                   English transcription as shown in table 3.                                                                                                                                                                                                and in the word medial position. In this case add /
/ if 
                                                                                Consonant and Vowels combine to make syllable                                                                                                                                                                                the  second  consonant  is  ‘r’  or  ‘l’  otherwise  // 
                                                                   and  breaking  up  a  word  into  syllables  is  known  as 
                                                                   syllabification.                                                 Sonority                                  sequence  principle  for                                                                                                       between two consonants and mark syllable boundary 
                                                                   syllabification is commonly used in Urdu. It requires                                                                                                                                                                                     after it as shown in table 4.
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                   the onset to rise in sonority towards the nucleus and                                                                                                                                                                                                 Table 4:  Examples of consonant cluster 
                                                                   codas to fall in sonority from the nucleus [9].                                                                                                                                                                                                                                                                             problem  
                                                                                                                                                                                                                                                                                                                          English                                             IPA                                   Urdu                                                 IPA 
                                                                   3.1 Algorithm                                                                                                                                                                                                                                                                                                                                             َ
                                                                                                                                                                                                                                                                                                                                                                                                                        KL
                                                                                                                                                                                                                                                                                                                     Treehoppe                                      tIh p                                                  ِ   َِ          t
.I.h .p#
                                                                            Template  matching  technique  is  used  to  syllabify                                                                                                                                                                                   r                                              #                                                            
                                                                   English transcription. In this technique syllabication is                                                                                                                                                                                                                                                                                                        َ  M
                                                                   done by matching template of the form                                                                                                                                                           [4].                                              Bless                                          bl"s                                                                  b
.l"s
                                                                                                                                                                                                                                  C0,1.V.Cn                                                                                                                                                                                          َ,ِ
                                                                   Urdu allows only one consonant in the onset position                                                                                                                                                                                                                                                                                                       "
                                                                                                                                                                                                                                                                                                                                                                                                                         . N 
                                                                                                                                                                                                                                                                                                                     Quickly                                        kw
kli                                             0                     k.w
k.li
                                                                   and multiple consonants can come in the coda position                                                                                                                                                                                                                                                                                                     ِ     ِ
                                                                   of the syllable.                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                              
                                                                   1.  Convert  the  entire  phonemic  transcription  of  the                                                                                                                                                                                3.2.2.  Urduization.  If  two  consonants  come  in  the 
                                                                            word to   consonant-vowel pairs                                                                                                                                                                                                  onset of the syllable in the word initial position and 
                                                                   2. Start from the end of the word, traverse backwards                                                                                                                                                                                     the starting consonant is ‘s’ then add // before ‘s’ as 
                                                                            to find the next vowel 
                                                                   3. repeat                                                                                                                                                                                                                                 shown in table 5. 
                                                                   4.     if there is a consonant preceding it                                                                                                                                                                                                
                                                                   5.         mark a syllable boundary before consonant                                                                                                                                                                                        Table 5:  Examples of Urduization applied on 
                                                                   6.     else                                                                                                                                                                                                                                                                                transliterated Urdu text 
                                                                   7.         mark the syllable boundary before this vowel                                                                                                                                                                                                                                                                                        
                                                                   8.     end if                                                                                                                                                                                                                                       English                                           IPA                                        Urdu                                                   IPA 
                                                                   9. until the entire string is consumed                                                                                                                                                                                                             School                                 skul                                                    O ُP
                         
s.kul
                                                                                                                                                                                                                                                                                                                                                                                                                                            ِ
                                                                                      Table 3:  Transliteration after applying                                                                                                                                                                                        Skill                                  sk
l                                                       QR
                        
s.k
l
                                                                                                                                           syllabification                                                                                                                                                                                                                                                                    َ      َِ      ِ
                                                                                                                                                                                                                                                                                                                      Special                                sp"l                                           QS
                             
s.p".l
                                                                            English                                               IPA                                                                           Urdu                                                                                                                                                                                                                M       ِ
                                                                                                                                                                          Unsyllabified                                             Syllabified                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                        17 
                                                       Proceedings of the Conference on Language & Technology 2009 
                       4.  Out-Of-Vocabulary problem                                                   ^                                                            (1) 
                                                                                                      P =argmaxp(P | E ) = argmax p(E | P)p(P)
                                                                                                       r                 r    n                   n   r     r
                                                                                                               P                       P
                                                                                                                r                       r
                            Out-Of-Vocabulary is a very common problem in                           The  trigram  language  model  p(P .P .P )  and 
                                                                                                                                                    r    r  r
                       various      systems       like    text-to-speech,        machine                                                             i−1 i   i+1
                                                                                                    bigram transliteration model  p(E .P ) is combined to 
                                                                                                                                              n  r
                       translation,     cross  language  information  retrieval                                                               i   i
                       (CLIR), etc. To resolve this problem, English phoneme                        maximize the pronunciation probabilityP . 
                                                                                                                                                       r
                       to  orthography  alignment  has  to  be  found  out                            
                       probabilistically  to  get  one-to-one  mapping  between                     4.2.  Computing optimal pronunciation 
                       them as shown in table 6, and then train those aligned                             sequence 
                       sequence  to  get  most  probable  pronunciation  for  an                     
                       unknown word.                                                                      Expectation  maximization  algorithm  is  used  to 
                                                                                                    compute optimal alignment sequence. The algorithm 
                       Table 6: English orthography to pronunciation                                is given below; 
                                                 alignment                                          Initialization 
                        English             Percentages                                             For each English phoneme to orthography pair, assign 
                        Pronunciatio        p . er . s . eh . n . t . ih . jh . ah . z              equal weights to all possibilities generated from (1). 
                        n                                                                           repeat 
                        Alignment           p(p) . er(er) . c(s) . e(eh) . n(n) . t(t) .                 Expectation-Step 
                                            a(ih) . g(jh) . e(ah) . s(z)                                 For  each  of  the  Arpabet  phonemes,  count  up 
                                                          
                       The entire procedure consists of two steps;                                       instances  of  its  different  mappings  from  the 
                       ·  English orthography to pronunciation alignment.                                observations on all combinations produced in (1). 
                       ·  Computing optimal pronunciation sequence.                                      Normalize  the  score  so  that  the  mapping 
                           After getting pronunciation of unknown text, it will                          probabilities sum to 1.  
                       be    passed      through      the    same  procedure  like                       Maximization-Step 
                       syllabification  and  then  Urdu  transliteration.  The                      Recalculate         the     combination         scores.      Each 
                       architecture of the OOV module is shown in figure 2.                         combination is scored with the product of the scores 
                                                                                                    of  the  symbol  mappings  it  contains.  Normalize  the 
                                        CMU pronunciation dictionary                                scores so that the mapping probabilities sum to 1. 
                                                                                                    until convergence 
                                                                                                     
                                 English orthography to phoneme alignment                      5.         Results  
                                                                                                     
                                                                                                          Transliteration  process  becomes  more  accurate 
                                                                                                    after  applying  syllabification  on  the  pronunciation 
                                     Pronunciation parameter estimation                             and  finding  probabilistic  sequences  of  Out-Of-
                                                                                                    Vocabulary word problem as shown in figure 3.  
                                                                                                                    English-to-Urdu Mappings 
                                                                                                     
                                    Pronunciation parameter optimization                                                               To achieve more accuracy 
                                                                                                     
                                                                                                                           Syllabification  
                                  Bigram transliteration model and trigram                           
                                         language model probabilities                                                                  To achieve more accuracy 
                                                                                                     
                           Figure 2: Architecture of English-to-Urdu                                                     Urduization Rules  
                                               transliteration                                       
                                                                                                                                       To enhance overall capability 
                       4.1.  Orthography to pronunciation alignment                                  
                                                                                                                        Out-Of-Vocabulary 
                            In this step all the valid combinations of English                       
                       orthography  E  to its pronunciation sequence  P  are                          Figure 3: Modules of the system that lead it 
                                         n                                           r                                   towards maturity 
                       produced using conditional probability;                                            
                                                                                             18
The words contained in this file might help you see if this file matches what you are looking for:

...Proceedings of the conference on language technology english to urdu transliteration system abbas raza ali and madiha ijaz center for research in processing national university computer emerging sciences lahore pakistan nu edu pk abstract text discards it as a result generated speech or translation lacks coherence applications encounter is being non specifically frequently developed eradicate this discrepancy shown accuracy these systems e g machine table etc highly undermined they are unable handle one possibility could be effect tts addition multilingual capabilities alex nokia so that may also along with o but approach quite taxing another transliterate into automatically then pass transliterati paper describes without first mapping rules used generate from transcription...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area