Arabic Alphabet Pdf 98234

Partial capture of text on file.
                                           Finite State Automata and Arabic Writing 
                                                                        Michel Fanton 
                                                                       CERTAL-INALCO 1 
                                                                           73 rue Broca 
                                                                     F75013 Paris France 
                                                                email :  certal2@ext.jussieu.fr 
                                             Abstract                                          isolated     final   medial  initial 
                      Arabic writing has specific features, which im- 
                      ply computational overload for any arabicized                   or present only small variations :  for example 
                      software. Finite state automata are well known                  letter ~r* (s) 
                      to give efficient solutions for translation  prob-                       isolated     final   medial  initial 
                      lems  which  can  be  formalized  as  regular  lan- 
                      guages. These automata are as more easily built 
                      that their alphabet have been reduced through a                    Letters which cannot be  bound to the next 
                      careful linguistic analysis.  This reduction makes              one have only two shapes, for example letters 
                      it possible to write directly an automaton with- 
                      out going through the intermediate stage of con-                (d) and .~ (w and fi) 
                      textual rules, which have to be translated into                       isolated    final           isolated     final 
                      an  automaton for the sake of efficiency.  This 
                      paper  presents two Moore automata, the first 
                      one, taken as an example, gives a solution to the                  During the seventies and the beginning of the 
                      choice of right shape for a  letter to be printed               eighties,  hard  controversies took  place  within 
                      or displayed (usually known as contextual anal-                 the Arabs concerned with these questions, lin- 
                      ysis), the second one studies the more complex                  guists and computer scientists.  Finally in  1983 
                      problem of determining the right carrying letter                the  ASMO  (Arab  Society  for  Normalization 
                      for hamza. Every arabicized software has to face                which unfortunately does not exist any more), 
                      these questions and  finite state  automata are                 influenced  by  Pr.  Lakhdar-Ghazal  from IERA 
                      certainly a good answer to them.                                (Rabat Morocco) chose to give a unique code to 
                      INTRODUCTION                                                    all  shapes of one particular letter.  This is cer- 
                                                                                      tainly a  good choice from a  linguistic point of 
                      Arabic writing has specific features, which im-                 view, but even so, compromises had to be made 
                      ply computational overload for any arabicized                   to  take  into  account  writing  habits  that  con- 
                      software.  The  first  one,  well  known  now for               flicted with it.  Letter hamza is the most notice- 
                      many years, is the fact that Arabic printing tries              able example of such a compromise for reasons 
                      to imitate handwriting. Because of this, conso-                 we shall explain later. 
                      nants and long vowels can have four or only two                 1    CONTEXTUAL  ANALYSIS 
                      shapes depending of their ability to be bound to 
                      the following letter and of where they appear in                Whatever be the choice made for coding, from 
                      the word.                                                       a  typesetting or a computational point of view, 
                       These shapes can be very different : for example               there must be different codes for the different 
                      letter  o 2 (h)                                                 shapes of a letter.  So every arabicized software 
                                                                                      has to use two systems for coding : the reduced 
                          ICERTAL : Centre d'l~tudes  et  de  Recherche  en           code we have just introduced and the extended 
                      Traitement  Automatique des Langues,  INALCO : In-              code in which the different shapes have different 
                      stitut  National  des Langues et Civilisations  Orientales 
                          ~the  Arabic parts  of this  paper  have  been  typeset     using Klaus Lagally's ArabTEX 
                                                                                26 
                    codes.  Up to UNICODE, no normalization  exists             Mealy automata  are  certainly  a  better  choice 
                    for the second one.  So every arabicized software           when bidirectional  applications  are considered. 
                    has  to solve the  problem of choosing the  right           As the question is to identify succession of sym- 
                    shape of every printed or displayed letter.                 bols of a certain  type we found it clearer to use 
                    1.1    Rules for letter shape                               a  Moore automaton. 
                           determination                                        1.3     A  Moore  automaton  for contextual 
                    This  determination,  frequently  known  as  con-                  analysis 
                    textual analysis can be summarized into the fol-            1.3.1     Source language of the automaton 
                    lowing set of unformal rules:                               It follows from the determination  rules that we 
                      1.  At the beginning of a word:                           only need to know what particular letter we are 
                                                                                dealing  with only at  the output stage.  All  we 
                            •  If the letter is a  binding  letter it takes     have to know is wether it is a  binding or a  non 
                               the INITIAL shape.                               binding letter 4.  The alphabet of the automaton 
                            •  If it  is  a  non  binding  one it  takes the    should  be A  =  (#} [J L  where L  is the set of 
                               ISOLATED shape.                                  arabic letters  present  in  the  reduced  code and 
                                                                                #  the word boundaries.  The set of letters will 
                      2.  In  the  middle  of a  word  (there  is  at  least    then be partitioned  into three sets • 
                          one letter following the current one): 
                           (a)  If the letter is a  binding  letter then                       A--+ A'-  {{#},N,B} 
                                 •  If it follows a binding letter it takes      N  being the set of non binding letters and B  the 
                                   the MEDIAL shape.                            set of binding  letters.  If we denote respectively 
                                 •  If it follows a  non  binding  letter it     n  and  b  an  arbitrary  element  of each  of these 
                                   takes the INITIAL shape.                     sets, the source language of the automaton can 
                          (b)  If the letter is a  non  binding  letter          be reduced to: 
                                 •  If it follows a binding letter it takes                         A1 =  {#, n, b} 
                                   the FINAL shape. 
                                 •  If it  follows a  non  binding  letter it                    L1 =  {#(n Vb)'#} 
                                   takes the ISOLATED shape. 
                      3.  At  the  end  of a  word  (for  both  types  of        where V denotes disjunction and • is the Kleene 
                          letters)                                              star 
                             •  If it follows a binding letter it takes the      1.3.2    Grammar  and automaton  for L1 
                               FINAL shape.                                      Language  L1  can  be generated  by  the  simple 
                                                                                 grammar : 
                            •  If  it  follows  a  non  binding  letter  it 
                               takes the ISOLATED shape.                                           m  -+  #A# 
                    1.2    Moore  and  Mealy  automata                                              A        A( lb) 
                    Moore automata are state assigned output ma-                 or the as simple automaton  : 
                    chines  :  the  output  function  assigns  output            initial  states =  {1} 
                    symbols to each state.  They differ from Mealy               final states:  -- {5} 
                    automata,  transition  assigned finite  state  ma-           transitions 
                    chines,  where  output  symbols  are  associated                4As this question has only been taken as an example, 
                    with  transitions  between  states.         Mealy  au-       the  alphabet  has been  oversimplified.  A  full  working 
                    tomata are sometimes called finite transducers.              automaton should  cope,  as far as arabic is concerned, 
                    The two machine types have been demonstrated                 with  two additional  problems  :  hamza  on the line to 
                    to  produce  the  same input-output  mappings 3.             which no preceding letter can be bound to and l~m alif 
                                                                                 ligature.  It should also give a  proper treatment of non 
                       3see (Aho and Unman,  1972) and (Hopcroft and Ull-        arabic letters and symbols. But this would not affect the 
                    man, 1979) for a full account of these matters               here described method. 
                                                                            27 
                                                                         This  automaton  is  clearly  nondeterministic. 
                                    b  n  #  output                      This  is  due  to  the  fact  that  a  letter from  B 
                                 100        2     0                      can appear in final or isolated shape when sit- 
                                 2340  #                                 uated at the end of a word, in initial or medial 
                                 334        5      b                     shape when another letter follows  it.  Because 
                                 434        5     n                      of this  nondeterministic feature, every transi- 
                                 5000             #                      tion should appear as a set.  When this set is a 
                                                                         singleton, the "only" state has been put without 
                   1.3.3   Target language of the automaton              braces for an easier reading. 
                   The alphabet for the target language L2, given          It can be easily augmented to take account of 
                   what has been said before and using the same          occasional short vowels or shadda 5 (') that could 
                   method of partioning and then reducing the al-        occur :  the transitions to add would force the 
                   phabet could be at first sight:                       automaton to loop onto the same state, what- 
                                                                         ever be it since vowels or shadda can only ap- 
                                  A2 =  {#,I,i,m,f}                      pear after a consonant and do not influence its 
                                                                         shape. 
                   where I  denotes a letter in isolated shape, i, m     1.3.5   PROLOG test program 
                   and f  stand for initial,  medial and final shape.    This program is a straightforward translation of 
                   But letters from N  have only two shapes final        the above described grammar and automaton. 
                   and isolated.  Moreover isolated and final shapes     The predicate test  allows to limit  the genera- 
                   of letters from B can only appear at the end of a     tion of inputs to a given length.  In the results 
                   word, which is not the case for the correspond-       we chose to limit  the length of the input to 6 
                   ing shapes of letters from N.  So, the following      included word boundaries. 
                   modified version of A2 will be prefered : 
                                                                         X 
                              A2 = {#,In, Ib, i,m,f~,fb}                 7, generation of elements of LI 
                   where In  stands for  isolated  shape  of a  letter   X 
                   from N  and so on.  With these symbols the tar-       m-->  [#],a,[#]. 
                   get language L2 can be described by the regular       a --> ([n];l'b]). 
                   expression :                                          a -->  a,([n];[b]). 
                                                                         X 
                         L2 = {#(I~im'fnI~)'(Ib  V fb V E)#}             7, translation automaton 
                                                                         Y. 
                   where E denotes as usual the empty string.            init ial_stat e ( 1). 
                   1.3.4    Translation automaton                        final_state (8). 
                   The translation process of a sequence of LI into      tr(1,#,l)       tr(4,b,5) 
                   a legal sequence of L2 can be operated through        tr(1 ,n,2)      tr(4,b,4) 
                   the following automaton :                             tr(1 ,b,3)      tr(4,n,6) 
                   initial states =  {1}                                 tr(1,b,7)       tr(5,#,8) 
                   final states =  {8}                                   tr(2,#,8)       tr(6,#,8) 
                   transitions :                                         tr(2,n,2)       tr(6,n,2) 
                                   n     b    #  output                  tr(2,b,3)       tr(6,b,3) 
                               1  2  {3,7}           #                   tr(2,b,7)       tr(6,b,7) 
                               2  2  {a,7}  8        I.                  tr(3,n,6)       tr(7,#,8) 
                               3  6  {4,5}  @         i                  tr(3,b,5) 
                               4   6  {4,5}    @     m                   tr(3,b,4) 
                               5  0      0     8     A                   output (I, #).        output (5,fb). 
                               6  2  {3,7}  S        f.                  output(2, 'In').  output(6,fn). 
                               7  0      0     S     Ib 
                               8  o      o     o     #                      5sign denoting a double letter 
                                                                     28 
                                                                                                  input                output 
                      output(3,i).               output(7,'Ib').                             [#,b,b,n,n,#] 
                      output(4,m).               output(8,#).                                                     [#,i,m,fn,In,#] 
                                                                                             [#,b,b,n,b,#]        [#,i,m,fn,Ib,#] 
                      forme(Input,Output):-                                                  [#,b,b,b,n,#]        [#,i,m,m,fn,#] 
                       initial_state(Is),                                                    [#,b,b,b,b,#]        [#,i,m,m,fb,#] 
                      path(Is,Fs,lnput,Output), 
                      final_state(Fs).                                                   2    WRITING  OF LETTER  HAMZA 
                                                                                         The hamza can be written in five different man- 
                      path(S,S,[],[]).                                                          $ 
                      path(SI,S2,[XIXs],[YIYs]):-                                        ners (I, !, 3, ~, ') depending mainly upon: 
                      tr(SI,X,S),                                                           •  its position within the word 
                      output(S,Y), 
                      path(S,S2,Xs,Ys).                                                     •  the preceding and the following vowel 
                      test(L):-                                                          As the choice made for coding, was to adhere 
                      m(M,[]),                                                           to a  linguistic point of view, there should have 
                      length(M,L1),                                                      been only one code for all these shapes and car- 
                       ((LI > L,!,nl,fail);true),                                        rying consonants. But, as it has just been said, 
                      printing_form(M,F),                                                to determine the correct writing of hamza, one 
                      nl,write(M),tab(1),write(F),fail.                                  has to know the surrounding vowels, and it is of 
                      test(_).                                                           common knowledge that the Arabs do not usu- 
                                                                                         ally write short vowels. These essential data be- 
                       1.3.6     Program results                                         ing missing, no algorithm can take place to ful- 
                                                                                         fil this task for a common usage such as display 
                             input                 output                                a  text on  a  screen.  Thus,  the ASMO  decided 
                            [#,n,#]               [#,In,#]                               to have distinct codes for the different carriers 
                            [#,b,#]               [#,Ib,#]                               of hamza, but  not of course for their different 
                           [#,n,n,#]            [#,In,In,#]                              shapes which can be determined as seen before. 
                           [#,n,b,#]            [#,In,Ib,#]                                 So why is this question of any interest ?  If we 
                           [#,b,n,#]             [#,i,fn,#]                              consider NLP applications for Arabic, it could 
                           [#,b,b,#]             [#,i,fb,#]                              worth  considering  this  problem  at  generation 
                         [#,n,n,n,#]          [#,In,In,In,#]                            stage.  For  instance  many  vowel  alternations 
                         [#,n,n,b,#]          [#,In,In,Ib,#]                            occur in  the conjugation of verbs, and  when a 
                         [#,n,b,n,#]           [#,In,i,fn,#]                             hamza is  present  in  the  verb  root,  the  hamza 
                         [#,n,b,b,#]           [#,In,i,fb,#]                             writing will vary accordingly. 
                         [#,b,n,n,#]           [#,i,fn,In,#]                               For example  the  verb  I~  qara'a  -  he  (has) 
                         [#,b,n,b,#]           [#,i,fn,Ib,#] 
                         [#,b,b,n,#]           [#,i,m,fn,#]                              read-changes to 5.~."  yaqra'fna-they                  read 
                         [#,b,b,b,#]           [#,i,m,fb,#]                                                        o 
                        [#,n,n,n,n,#]        [#,In,In,In,In,#]                           (present)  -  and  to  ~.z~  quri'a  -  it  (has)  been 
                        [#,n,n,n,b,#]        [#,In,In,In,Ib,#]                           read.  And  at  the generation stage vowels are 
                        [#,n,n,b,n,#]         [#,In,In,i,fn,#]                           known even if we decided  not to write them. 
                        [#,n,n,b,b,#]         [#,In,In,i,fb,#]                           The only  alternative  would  be  to  put  all  the 
                        [#,n,b,n,n,#]         [#,In,i,fn,In,#]                          forms in a dictionary. At CERTAL, our philos- 
                        [#,n,b,n,b,#]         [#,In,i,fn,Ib,#]                          ophy is to use all the possible means to reduce 
                        [#,n,b,b,n,#]         [#,In,i,m,fn,#]                           the size of dictionaries.  Hence this question ap- 
                        [#,n,b,b,b,#l         [#,In,i,m,fb,#]                            peared to us worth studying. 
                        [#,b,n,n,n,#l         [#,i,fn ,In ,In,#]                        2.1     Rules of hamza writing 
                        [#,b,n,n,b,#]         [#,i,fn,In,Ib,#] 
                        [#,b,n,b,n,#]          [#,i,fn,i,fn,#]                             1.  When a hamza is at the beginning of a word 
                        [#,b,n,b,b,#]          [#,i,fn,i,fb,#]                                it is written 
                                                                                   29
The words contained in this file might help you see if this file matches what you are looking for:

...Finite state automata and arabic writing michel fanton certal inalco rue broca f paris france email ext jussieu fr abstract isolated final medial initial has specific features which im ply computational overload for any arabicized or present only small variations example software are well known letter r s to give efficient solutions translation prob lems can be formalized as regular lan guages these more easily built that their alphabet have been reduced through a letters cannot bound the next careful linguistic analysis this reduction makes one two shapes it possible write directly an automaton with out going intermediate stage of con d w fi textual rules translated into sake efficiency paper presents moore first taken gives solution during seventies beginning choice right shape printed eighties hard controversies took place within displayed usually contextual anal arabs concerned questions lin ysis second studies complex guists computer scientists finally in problem determining carry...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area