127x Filetype PDF File size 0.21 MB Source: ufal.mff.cuni.cz
ElixirFM—ImplementationofFunctionalArabicMorphology ˇ OtakarSmrz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague otakar.smrz@mff.cuni.cz Abstract In Section 3, we survey some of the categories of the syntax–morphologyinterfaceinModernWritten FunctionalArabicMorphologyisaformula- Arabic, as described by the Functional Arabic Mor- tion of the Arabic inflectional system seek- phology. Inpassing,wewillintroducethebasiccon- ing the working interface between morphol- cepts of programming in Haskell, a modern purely ogy and syntax. ElixirFM is its high-level functional language that is an excellent choice for implementation that reuses and extends the declarative generative modeling of morphologies, as Functional Morphology library for Haskell. Forsberg and Ranta (2004) have shown. Inflection and derivation are modeled in Section4willbedevotedtodescribingthelexicon terms of paradigms, grammatical categories, of ElixirFM. We will develop a so-called domain- lexemes and word classes. The computation specific language embedded in Haskell with which of analysis or generation is conceptually dis- we will achieve lexical definitions that are simulta- tinguished from the general-purpose linguis- neously a source code that can be checked for con- tic model. The lexicon of ElixirFM is de- sistency, a data structure ready for rather indepen- signed with respect to abstraction, yet is no dent processing, and still an easy-to-read-and-edit more complicated than printed dictionaries. document resembling the printed dictionaries. It is derived from the open-source Buckwal- In Section 5, we will illustrate how rules of in- ter lexicon and is enhanced with information flection and derivation interact with the parameters sourcing from the syntactic annotations of of the grammar and the lexical information. We will the Prague Arabic Dependency Treebank. demonstrate, also with reference to the Functional 1 Overview Morphologylibrary (Forsberg and Ranta, 2004), the reusability of the system in many applications, in- Onecanobserveseveraldifferentstreamsbothinthe cluding computational analysis and generation in computational and the purely linguistic modeling of various modes, exploring and exporting of the lex- morphology. Somearemotivatedbytheneedtoana- icon, printing of the inflectional paradigms, etc. lyze word forms as to their compositional structure, others consider word inflection as being driven by 2 Morphological Models the underlying system of the language and the for- According to Stump (2001), morphological theories malrequirements of its grammar. can be classified along two scales. The first one In Section 2, before we focus on the principles of deals with the core or the process of inflection: ElixirFM, we briefly follow the characterization of morphological theories presented by Stump (2001) lexical theories associate word’s morphosyntactic and extend the classification to the most promi- properties with affixes nent computational models of Arabic morphology (Beesley, 2001; Buckwalter, 2002; Habash et al., inferential theories consider inflection as a result of 2005; El Dada and Ranta, 2006). operations on lexemes; morphosyntactic prop- erties are expressed by the rules that relate the tem. The Arabic resource grammar in the Grammat- form in a given paradigm to the lexeme ical Framework (El Dada and Ranta, 2006) is per- The second opposition concerns the question of haps the most complete inferential–realizational im- inferability of meaning, and theories divide into: plementation to date. Its style is compatible with the linguistic description in e.g. (Fischer, 2001) or incremental words acquire morphosyntactic prop- (Badawi et al., 2004), but the lexicon is now very erties only in connection with acquiring the in- limited and some other extensions for data-oriented flectional exponents of those properties computational applications are still needed. ElixirFMisinspiredbythemethodologyin(Fors- realizational association of a set of properties with berg and Ranta, 2004) and by functional program- a word licenses the introduction of the expo- ming, just like the Arabic GF is (El Dada and Ranta, nents into the word’s morphology 2006). Nonetheless, ElixirFM reuses the Buckwal- Evidence favoring inferential–realizational theo- ter lexicon (2002) and the annotations in the Prague ˇ ries over the other three approaches is presented by Arabic Dependency Treebank (Hajic et al., 2004), Stump (2001) as well as Baerman et al. (2006) or and implements a yet more refined linguistic model. Spencer (2004). In trying to classify the implemen- 3 Morphosyntactic Categories tations of Arabic morphological models, let us re- Functional Arabic Morphology and ElixirFM re- consider this cross-linguistic observation: establish the system of inflectional and inher- The morphosyntactic properties associ- ent morphosyntactic properties (alternatively named ated with an inflected word’s individ- grammatical categories or features) and distinguish ual inflectional markings may underdeter- precisely the senses of their use in the grammar. mine the properties associated with the In Haskell, all these categories can be represented wordasawhole. (Stump, 2001, p. 7) as distinct data types that consist of uniquely identi- fiedvalues. Wecanforinstancedeclarethatthecate- How do the current morphological analyzers in- gory of case in Arabic discerns three values, that we terpret, for instance, the number and gender of the also distinguish three values for number or person, Arabic broken masculine plurals gudud XYg new ˇ . or two values of the given names for verbal voice: ones or qudah èA¯ judges, or the case of mustawan . ¯ data Case = Nominative | Genitive | Accusative øñJÓalevel? Do they identify the values of these data Number = Singular | Dual | Plural features that the syntax actually operates with, or is data Person = First | Second | Third theresolutionhinderedbysometoogenericassump- data Voice = Active | Passive tions about the relation between meaning and form? All these declarations introduce new enumerated Many of the computational models of Arabic types, and we can use some easily-defined meth- morphology,includinginparticular(Beesley, 2001), ods of Haskell to work with them. If we load this (Ramsay and Mansur, 2001) or (Buckwalter, 2002), 1 are lexical in nature. As they are not designed in (slightly extended) program into the interpreter, we connection with any syntax–morphology interface, can e.g. ask what category the value Genitive be- their interpretations are destined to be incremental. longs to (seen as the :: type signature), or have it Some signs of a lexical–realizational system can evaluate the list of the values that Person allows: be found in (Habash, 2004). The author mentions ? :type Genitive → Genitive :: Case and fixes the problem of underdetermination of in- ? enum :: [Person] → [First,Second,Third] herent number with broken plurals, when develop- Lists in Haskell are data types that can be ing a generative counterpart to (Buckwalter, 2002). parametrized by the type that they contain. So, the The computational models in (Soudi et al., 2001) value[Active, Active, Passive]isalistofthree and (Habash et al., 2005) attempt the inferential– elementsoftypeVoice,andwecanwritethisifnec- realizational direction. Unfortunately, they imple- essary as the signature :: [Voice]. Lists can also mentonlysectionsoftheArabicmorphologicalsys- 1http://www.haskell.org/ beemptyorhavejustonesingleelement. Wedenote state in the sense of Fischer, and adding a boolean lists containing some type a as being of type [a]. feature for the presence of the definite article... Haskell provides a number of useful types al- However, we would get one unacceptable combina- ready, such as the enumerated boolean type or the tion of the values claiming the presence of the def- parametric type for working with optional values: inite article and yet the indefinite state, i.e. possibly data Bool = True | False the indefinite article or the diptotic declension. data Maybe a = Just a | Nothing Functional Arabic Morphology refactors the six Similarly, we can define a type that couples other different kinds of forms (if we consider all inflec- values together. In the general form, we can write tional situations) depending on two parameters. The data Couple a b = a :-: b first controls prefixation of the (virtual) definite arti- which introduces the value :-: as a container for cle, the other reduces some suffixes if the word is a somevalueoftypeaandanotheroftypeb.2 head of an annexation. In ElixirFM, we define these Let us return to the grammatical categories. In- parameters as type synonyms to what we recall: flection of nominals is subject to several formal re- type Definite = Maybe Bool type Annexing = Bool quirements, which different morphological models The Definite values include Just True for decompose differently into features and values that forms with the definite article, Just False for are not always complete with respect to the inflec- forms in some compounds or after la B or ya AK tional system, nor mutually orthogonal. We will ex- ¯ ¯ plain what we meanbyrevisitingthenotionsofstate (absolute negatives or vocatives), and Nothing for and definiteness in contemporary written Arabic. formsthatrejectthedefinitearticleforotherreasons. To minimize the confusion of terms, we will de- Functional Arabic Morphology considers state as part from the formulation presented in (El Dada and aresult of coupling the two independent parameters: Ranta, 2006). In there, there is only one relevant type State = Couple Definite Annexing category, which we can reimplement as State’: Thus, the indefinite state Indef describes a word data State’ = Def | Indef | Const void of the definite article(s) and not heading an an- Variation of the values of State’ would enable gen- nexation, i.e. Nothing :-: False. Conversely, ar- rafı֒u ñªJ¯QË@ is in the state Just True :-: True. ¯ ¯ eratingtheformsal-kitabu HAJºË@def.,kitabun HAJ» ¯ . ¯ . The classical construct state is Nothing :-: True. indef., and kitabu HAJ» const. for the nominative ¯ . The definite state is Just _ :-: False, where _ is singular of book. This seems fine until we explore True for El Dada and Ranta and False for Fischer. more inflectional classes. The very variation for the We may discover that now all the values of State nominative plural masculine of the adjective high 3 are meaningful. gets ar-rafı֒una àñªJ¯QË@ def., rafı֒una àñªJ¯P in- ¯ ¯ ¯ ¯ Type declarations are also useful for defining in def., and rafı֒u ñªJ¯P const. But what value does ¯ ¯ what categories a given part of speech inflects. For the form ar-rafı֒u ñªJ¯QË@, found in improper annex- ¯ ¯ verbs, this is a bit more involved, and we leave it for ations such as in al-mas֓uluna ’r-rafı֒u ’l-mustawa ¯ ¯ ¯ ¯ ¯ Figure 2. For nouns, we set this algebraic data type: Ï Ï øñJÜ@ ñªJ¯QË@ àñËðñÜ @ the-officials the-highs- data ParaNoun = NounS Number Case State of the-level, receive? It is interesting to consult for instance (Fischer, In the interpreter, we can now generate all 54 2001), where state has exactly the values of State’, combinations of inflectional parameters for nouns: but where the definite state Def covers even forms ? [ NounS n c s | n <- enum, c <- enum, without the prefixed al- Ë@ article, since also some s <- values ] separate words like la B no or ya AK oh can have the ¯ ¯ Thefunction values is analogous to enum, and both effects on inflection that the definite article has. To need to know their type before they can evaluate. distinguish all the forms, we might think of keeping 3 WithJust False :-: True,wecanannotatee.g.the 2 Infixoperators can also be written as prefix functions if en- ‘incorrectly’ underdetermined rafı֒uñªJ¯P in hum-u ’l-mas֓ulu- ¯ ¯ ¯ ¯ Ï Ï closed in (). Functions can be written as operators if enclosed narafı֒u ’l-mustawa øñJÖ @ ñªJ¯P àñËðñÖ @ Ñë they-are the- ¯ ¯ ¯ in ‘‘. Wewillexploitthiswhendefiningthelexicon’snotation. officials highs-of the-level, i.e. they are the high-level officials. The ‘magic’ is that the bound variables n, c, and s The whole generative model adopts the multi- havetheirtypedeterminedbytheNounSconstructor, purpose notation of ArabT X (Lagally, 2004) as a E soweneednottypeanythingexplicitly. Weusedthe meta-encoding of both the orthography and phonol- list comprehension syntax to cycle over the lists that ogy. Therefore, instantiation of the "’" hamza car- enum and values produce, cf. (Hudak, 2000). riers or other merely orthographic conventions do 4 ElixirFMLexicon not obscure the morphological model. With Encode Arabic4 interpreting the notation, ElixirFM can at Unstructuredtextisjustalistofcharacters,orstring: the surface level process the original Arabic script type String = [Char] (non-)vocalized to any degree or work with some Yet words do have structure, particularly in Arabic. kind of transliteration or even transcription thereof. We will work with strings as the superficial word Morphophonemic patterns represent the stems of forms, but the internal representations will be more words. The various kinds of abstract prefixes and abstract (and computationally more efficient, too). suffixes can be expressed either as atomic values, or The definition of lexemes can include the deriva- as literal strings wrapped into extra constructors: tional root and pattern information if appropriate, data Prefix = Al | LA | Prefix String cf. (Habash et al., 2005), and our model will encour- data Suffix = Iy | AT | At | An | Ayn | Un | In | Suffix String age this. The surface word kitab HAJ»book can de- ¯ . composetothetriconsonantal root k t b IJ»and the al = Al; lA = LA -- function synonyms . morphophonemicpattern FiCAL of type PatternT: data PatternT = FaCaL | FAL | FaCY | aT = AT; ayn = Ayn; aN = Suffix "aN" FiCAL | FuCCAL | {- ... -} Affixes and patterns are arranged together via MustaFCaL | MustaFaCL the Morphs a data type, where a is a triliteral pat- deriving (Eq, Enum, Show) The deriving clause associates PatternT with tern PatternT or a quadriliteral PatternQ or a non- methodsfortestingequality,enumeratingalltheval- templatic word stem Identity of type PatternL: ues, and turning the names of the values into strings: data PatternL = Identity data PatternQ = KaRDaS | KaRADiS {- ... -} ? show FiCAL → "FiCAL" data Morphs a = Morphs a [Prefix] [Suffix] Wechoose to build on morphophonemic patterns The word la-silkıy ú¾ÊB wireless can thus be rather than CV patterns and vocalisms. Words like ¯ ¯ decomposed as the root s l k ½Ê and the value istagab HAjJ@ to respond and istagwab HñjJ@ ˇ ¯ . . ˇ . . to interrogate have the same underlying VstVCCVC Morphs FiCL [LA] [Iy]. Shunning such concrete pattern, so information on CV patterns alone would representations, we define new operators >| and |< notbeenoughtoreconstructthesurfaceforms. Mor- that denote prefixes, resp. suffixes, inside Morphs a: phophonemic patterns, in this case IstaFAL and ? lA >| FiCL |< Iy → Morphs FiCL [LA][Iy] IstaFCaL, can easily be mapped to the hypothetical Implementing>|and|b where pact way. Of course, ElixirFM provides functions morph :: a -> Morphs b for properly interlocking the patterns with the roots: ? merge "k t b" FiCAL → "kitAb" instance Morphing (Morphs a) a where ? merge "ˆg w b" IstaFAL → "istaˆgAb" morph = id ? merge "ˆg w b" IstaFCaL → "istaˆgwab" instance Morphing PatternT PatternT where ? merge "s ’ l" MaFCUL → "mas’Ul" morph x = Morphs x [] [] ? merge "z h r" IFtaCaL → "izdahar" The instance declarations ensure how the morph The izdahar QëXP@ to flourish case exemplifies that method would turn values of type a into Morphs b. exceptionless assimilations need not be encoded in the patterns, but can instead be hidden in rules. 4http://sf.net/projects/encode-arabic/
no reviews yet
Please Login to review.