138x Filetype PDF File size 0.57 MB Source: aclanthology.org
Finite State Automata and Arabic Writing Michel Fanton CERTAL-INALCO 1 73 rue Broca F75013 Paris France email : certal2@ext.jussieu.fr Abstract isolated final medial initial Arabic writing has specific features, which im- ply computational overload for any arabicized or present only small variations : for example software. Finite state automata are well known letter ~r* (s) to give efficient solutions for translation prob- isolated final medial initial lems which can be formalized as regular lan- guages. These automata are as more easily built that their alphabet have been reduced through a Letters which cannot be bound to the next careful linguistic analysis. This reduction makes one have only two shapes, for example letters it possible to write directly an automaton with- out going through the intermediate stage of con- (d) and .~ (w and fi) textual rules, which have to be translated into isolated final isolated final an automaton for the sake of efficiency. This paper presents two Moore automata, the first one, taken as an example, gives a solution to the During the seventies and the beginning of the choice of right shape for a letter to be printed eighties, hard controversies took place within or displayed (usually known as contextual anal- the Arabs concerned with these questions, lin- ysis), the second one studies the more complex guists and computer scientists. Finally in 1983 problem of determining the right carrying letter the ASMO (Arab Society for Normalization for hamza. Every arabicized software has to face which unfortunately does not exist any more), these questions and finite state automata are influenced by Pr. Lakhdar-Ghazal from IERA certainly a good answer to them. (Rabat Morocco) chose to give a unique code to INTRODUCTION all shapes of one particular letter. This is cer- tainly a good choice from a linguistic point of Arabic writing has specific features, which im- view, but even so, compromises had to be made ply computational overload for any arabicized to take into account writing habits that con- software. The first one, well known now for flicted with it. Letter hamza is the most notice- many years, is the fact that Arabic printing tries able example of such a compromise for reasons to imitate handwriting. Because of this, conso- we shall explain later. nants and long vowels can have four or only two 1 CONTEXTUAL ANALYSIS shapes depending of their ability to be bound to the following letter and of where they appear in Whatever be the choice made for coding, from the word. a typesetting or a computational point of view, These shapes can be very different : for example there must be different codes for the different letter o 2 (h) shapes of a letter. So every arabicized software has to use two systems for coding : the reduced ICERTAL : Centre d'l~tudes et de Recherche en code we have just introduced and the extended Traitement Automatique des Langues, INALCO : In- code in which the different shapes have different stitut National des Langues et Civilisations Orientales ~the Arabic parts of this paper have been typeset using Klaus Lagally's ArabTEX 26 codes. Up to UNICODE, no normalization exists Mealy automata are certainly a better choice for the second one. So every arabicized software when bidirectional applications are considered. has to solve the problem of choosing the right As the question is to identify succession of sym- shape of every printed or displayed letter. bols of a certain type we found it clearer to use 1.1 Rules for letter shape a Moore automaton. determination 1.3 A Moore automaton for contextual This determination, frequently known as con- analysis textual analysis can be summarized into the fol- 1.3.1 Source language of the automaton lowing set of unformal rules: It follows from the determination rules that we 1. At the beginning of a word: only need to know what particular letter we are dealing with only at the output stage. All we • If the letter is a binding letter it takes have to know is wether it is a binding or a non the INITIAL shape. binding letter 4. The alphabet of the automaton • If it is a non binding one it takes the should be A = (#} [J L where L is the set of ISOLATED shape. arabic letters present in the reduced code and # the word boundaries. The set of letters will 2. In the middle of a word (there is at least then be partitioned into three sets • one letter following the current one): (a) If the letter is a binding letter then A--+ A'- {{#},N,B} • If it follows a binding letter it takes N being the set of non binding letters and B the the MEDIAL shape. set of binding letters. If we denote respectively • If it follows a non binding letter it n and b an arbitrary element of each of these takes the INITIAL shape. sets, the source language of the automaton can (b) If the letter is a non binding letter be reduced to: • If it follows a binding letter it takes A1 = {#, n, b} the FINAL shape. • If it follows a non binding letter it L1 = {#(n Vb)'#} takes the ISOLATED shape. 3. At the end of a word (for both types of where V denotes disjunction and • is the Kleene letters) star • If it follows a binding letter it takes the 1.3.2 Grammar and automaton for L1 FINAL shape. Language L1 can be generated by the simple grammar : • If it follows a non binding letter it takes the ISOLATED shape. m -+ #A# 1.2 Moore and Mealy automata A A( lb) Moore automata are state assigned output ma- or the as simple automaton : chines : the output function assigns output initial states = {1} symbols to each state. They differ from Mealy final states: -- {5} automata, transition assigned finite state ma- transitions chines, where output symbols are associated 4As this question has only been taken as an example, with transitions between states. Mealy au- the alphabet has been oversimplified. A full working tomata are sometimes called finite transducers. automaton should cope, as far as arabic is concerned, The two machine types have been demonstrated with two additional problems : hamza on the line to to produce the same input-output mappings 3. which no preceding letter can be bound to and l~m alif ligature. It should also give a proper treatment of non 3see (Aho and Unman, 1972) and (Hopcroft and Ull- arabic letters and symbols. But this would not affect the man, 1979) for a full account of these matters here described method. 27 This automaton is clearly nondeterministic. b n # output This is due to the fact that a letter from B 100 2 0 can appear in final or isolated shape when sit- 2340 # uated at the end of a word, in initial or medial 334 5 b shape when another letter follows it. Because 434 5 n of this nondeterministic feature, every transi- 5000 # tion should appear as a set. When this set is a singleton, the "only" state has been put without 1.3.3 Target language of the automaton braces for an easier reading. The alphabet for the target language L2, given It can be easily augmented to take account of what has been said before and using the same occasional short vowels or shadda 5 (') that could method of partioning and then reducing the al- occur : the transitions to add would force the phabet could be at first sight: automaton to loop onto the same state, what- ever be it since vowels or shadda can only ap- A2 = {#,I,i,m,f} pear after a consonant and do not influence its shape. where I denotes a letter in isolated shape, i, m 1.3.5 PROLOG test program and f stand for initial, medial and final shape. This program is a straightforward translation of But letters from N have only two shapes final the above described grammar and automaton. and isolated. Moreover isolated and final shapes The predicate test allows to limit the genera- of letters from B can only appear at the end of a tion of inputs to a given length. In the results word, which is not the case for the correspond- we chose to limit the length of the input to 6 ing shapes of letters from N. So, the following included word boundaries. modified version of A2 will be prefered : X A2 = {#,In, Ib, i,m,f~,fb} 7, generation of elements of LI where In stands for isolated shape of a letter X from N and so on. With these symbols the tar- m--> [#],a,[#]. get language L2 can be described by the regular a --> ([n];l'b]). expression : a --> a,([n];[b]). X L2 = {#(I~im'fnI~)'(Ib V fb V E)#} 7, translation automaton Y. where E denotes as usual the empty string. init ial_stat e ( 1). 1.3.4 Translation automaton final_state (8). The translation process of a sequence of LI into tr(1,#,l) tr(4,b,5) a legal sequence of L2 can be operated through tr(1 ,n,2) tr(4,b,4) the following automaton : tr(1 ,b,3) tr(4,n,6) initial states = {1} tr(1,b,7) tr(5,#,8) final states = {8} tr(2,#,8) tr(6,#,8) transitions : tr(2,n,2) tr(6,n,2) n b # output tr(2,b,3) tr(6,b,3) 1 2 {3,7} # tr(2,b,7) tr(6,b,7) 2 2 {a,7} 8 I. tr(3,n,6) tr(7,#,8) 3 6 {4,5} @ i tr(3,b,5) 4 6 {4,5} @ m tr(3,b,4) 5 0 0 8 A output (I, #). output (5,fb). 6 2 {3,7} S f. output(2, 'In'). output(6,fn). 7 0 0 S Ib 8 o o o # 5sign denoting a double letter 28 input output output(3,i). output(7,'Ib'). [#,b,b,n,n,#] output(4,m). output(8,#). [#,i,m,fn,In,#] [#,b,b,n,b,#] [#,i,m,fn,Ib,#] forme(Input,Output):- [#,b,b,b,n,#] [#,i,m,m,fn,#] initial_state(Is), [#,b,b,b,b,#] [#,i,m,m,fb,#] path(Is,Fs,lnput,Output), final_state(Fs). 2 WRITING OF LETTER HAMZA The hamza can be written in five different man- path(S,S,[],[]). $ path(SI,S2,[XIXs],[YIYs]):- ners (I, !, 3, ~, ') depending mainly upon: tr(SI,X,S), • its position within the word output(S,Y), path(S,S2,Xs,Ys). • the preceding and the following vowel test(L):- As the choice made for coding, was to adhere m(M,[]), to a linguistic point of view, there should have length(M,L1), been only one code for all these shapes and car- ((LI > L,!,nl,fail);true), rying consonants. But, as it has just been said, printing_form(M,F), to determine the correct writing of hamza, one nl,write(M),tab(1),write(F),fail. has to know the surrounding vowels, and it is of test(_). common knowledge that the Arabs do not usu- ally write short vowels. These essential data be- 1.3.6 Program results ing missing, no algorithm can take place to ful- fil this task for a common usage such as display input output a text on a screen. Thus, the ASMO decided [#,n,#] [#,In,#] to have distinct codes for the different carriers [#,b,#] [#,Ib,#] of hamza, but not of course for their different [#,n,n,#] [#,In,In,#] shapes which can be determined as seen before. [#,n,b,#] [#,In,Ib,#] So why is this question of any interest ? If we [#,b,n,#] [#,i,fn,#] consider NLP applications for Arabic, it could [#,b,b,#] [#,i,fb,#] worth considering this problem at generation [#,n,n,n,#] [#,In,In,In,#] stage. For instance many vowel alternations [#,n,n,b,#] [#,In,In,Ib,#] occur in the conjugation of verbs, and when a [#,n,b,n,#] [#,In,i,fn,#] hamza is present in the verb root, the hamza [#,n,b,b,#] [#,In,i,fb,#] writing will vary accordingly. [#,b,n,n,#] [#,i,fn,In,#] For example the verb I~ qara'a - he (has) [#,b,n,b,#] [#,i,fn,Ib,#] [#,b,b,n,#] [#,i,m,fn,#] read-changes to 5.~." yaqra'fna-they read [#,b,b,b,#] [#,i,m,fb,#] o [#,n,n,n,n,#] [#,In,In,In,In,#] (present) - and to ~.z~ quri'a - it (has) been [#,n,n,n,b,#] [#,In,In,In,Ib,#] read. And at the generation stage vowels are [#,n,n,b,n,#] [#,In,In,i,fn,#] known even if we decided not to write them. [#,n,n,b,b,#] [#,In,In,i,fb,#] The only alternative would be to put all the [#,n,b,n,n,#] [#,In,i,fn,In,#] forms in a dictionary. At CERTAL, our philos- [#,n,b,n,b,#] [#,In,i,fn,Ib,#] ophy is to use all the possible means to reduce [#,n,b,b,n,#] [#,In,i,m,fn,#] the size of dictionaries. Hence this question ap- [#,n,b,b,b,#l [#,In,i,m,fb,#] peared to us worth studying. [#,b,n,n,n,#l [#,i,fn ,In ,In,#] 2.1 Rules of hamza writing [#,b,n,n,b,#] [#,i,fn,In,Ib,#] [#,b,n,b,n,#] [#,i,fn,i,fn,#] 1. When a hamza is at the beginning of a word [#,b,n,b,b,#] [#,i,fn,i,fb,#] it is written 29
no reviews yet
Please Login to review.