Language Pdf 99870 | Cs769 Final Report

Partial capture of text on file.
                                    AnAttemptatMultilingualPOSTaggingforTamil
                                      MadhuRamanathan,VijayChidambaram,AshishPatro
                                                     Department of Computer Sciences
                                                      University of Wisconsin Madison
                                       Abstract                           1 Introduction
                       Part of Speech (POS) tagging is the process        Part of speech (POS) tagging is the process of
                       of providing every word in a corpus with a         labeling a part of speech or other lexical class
                       syntactic category. In our project we aim          marker to each and every word in a sentence. POS
                       to do supervised and unsupervised methods          tagging is an essential part of many applications
                       of POS tagging using a multilingual parallel
                       corpora for Tamil, an agglutinative language       like speech recognition, natural language parsing,
                       of ancient Dravidian origin.  The multilin-        information retrieval and machine translation.
                       gual parallel corpora consists of three other
                       languages namely Hindi, Latin, English and            Our aim is to perform POS tagging for Tamil
                       French.  We experimented on monolingual,           which is a Dravidian Language spoken in the
                       bilingual and multilingual corpora using var-      Southern part of India which has existed for
                       ious models and techniques such as the HMM
                       model, SVM model, CRF model and Pro-               over two thousand years.        Tamil and Sanskrit
                       jection and Probability Re-estimation tech-        are considered the two longest surviving clas-
                       nique(Yarowsky,2001)anddidadetailedper-            sical languages in India, from which the others
                       formance comparison in an attempt to cap-          Dravidian and Indo-Aryan languages have been
                       ture the properties of the language that aid in    derived. Tamil also has a rich set of literary works
                       increased accuracy for POS tagging. Super-         like the Thirukurral which have been manually
                       vised CRF modeling using a variety of fea-         translated into a number languages.       Our aim is
                       tures on a monolingual Tamil corpus revealed
                       that word speciﬁc features such as preﬁxes         to use such parallel corpus and build a method
                       and sufﬁxes produce an increase of 10% the         to improve the accuracy of existing taggers that
                       highest among all combinations of features.        can be used for other applications like automatic
                       Bilingual and multilingual learning shows that     machinetranslation,speechrecognitionandparsing.
                       the addition of other languages generally pro-
                       duceadecreaseinaccuracymainlybecauseof                Tamil uses a relatively free word order aggluti-
                       the one to many association among the words
                       while the other reasons being the drop in ac-      native grammar, where sufﬁxes are used to mark
                       curacy produced at every stage of the vari-        noun class, number, and case, verb tense and other
                       ous pre-processing steps involved in accom-        grammatical categories.Tamil words consist of
                       plishing the word level pairing. The results of    a lexical root to which one or more afﬁxes are
                       our experiments clearly reﬂect the relatively      attached.   Most Tamil afﬁxes are sufﬁxes. Tamil
                       free word order and agglutinative nature of        sufﬁxes are of two types : derivational sufﬁxes,
                       the Tamil language and motivates the need          which either change the part of speech of the word
                       for a morpheme based POS tagger to attain a        or its meaning, or inﬂectional sufﬁxes, which mark
                       greater accuracy.
                                                                          categories such as person, number, mood, tense, etc.
                 There is no absolute limit on the length and extent   chose are Hindi, English and French. Tamil follows
                 of agglutination, which can lead to long words with   a SOV word order and we chose Hindi as it a well
                 a large number of sufﬁxes      (Tamil, Wikipedia).    studied Indian Language with same word order.
                 Much of Tamil grammar is extensively described        We also choose two other languages that have the
                 in the oldest known grammar book for Tamil, the       SVOwordorder namely English and French to see
                 Tolkppiyam.                                           how much the word order property inﬂuences the
                                                                       accuracy of the results.
                   The agglutinative nature of Tamil makes tagging
                 a complex process. Various methodologies, both          The remainder of the paper is organized into 5
                 statistical and rule based, have been developed and   sections.  Section 2 deals with the related work,
                 widelyusedforPOSTaggingindifferentlanguages.          section 3 talks about the method, section 4 about
                 Tamil being a free form language with a large va-     the experiments and analysis and section 5 gives the
                 riety of morphological combinations, inﬂections       concluding remarks.
                 and exceptions, developing a rule based method for
                 it would require a lot of effort and also extensive
                 knowledge about the complex grammatical struc-        2 RelatedWork
                 tures which makes it almost impractical. Supervised
                 statistical methods require a large amount of reli-   Tamil is one of the classical Indian languages which
                 able annotated corpus that can be used for training   has a very strong linguistic base with well deﬁned
                 purposes.  At the same time a considerable large      set of morpho-syntactic rules.   However parsing,
                 amount of sentence aligned parallel data (UDHR        development of parsing models, chunking, gen-
                 corpora, Bible corpora, Thirukural corpora, TV        eration of Treebank, POS tagging, morphological
                 news, newspaper articles,etc) are available in a      analysis, and development of semi-automated and
                 number of languages that we can put to use for        automated tools for these processes in Tamil are
                 this purpose. A large number of those languages       at the nascent stage. The existing works on POS
                 such as the European languages have pre-trained       tagging is based on morphological analyzers which
                 POS taggers that can be used to label the text in     was built by Vasu Ranganathan (Renganathan,
                 those languages. Consider these factors we tried to   2001) and Ganesan and RCILTS-T. Due to the con-
                 address three main questions:                         straints, limited coverage of morpho-syntactic and
                                                                       semantic rules, non-availability of methodologies
                                                                       towards large scale development of parsing models,
                   • When trained on a monolingual corpus what         non-availability of standards, non applicability of
                      properties/features of the language contribute   statistical methods and resource deﬁciency, reported
                      to increasing the POS tagging accuracy?          tools cannot be used directly for all types of NLP
                   • Does the addition of one or more languages        applications. These existing tools have been devel-
                      from a parallel corpus help in increasing the    oped using rule based approaches. However, rule
                      POS tagging accuracy? If the addition of lan-    based techniques cannot address all inﬂectional and
                      guages does improve the tagging accuracy then    derivational word forms and peculiar characteristics
                      are they any speciﬁc properties of the language  like relative free word order, syntax with semantics
                      being paired that lead to an increase in accu-   and long distance relationship to a greater extent.
                      racy?                                            Moderate accuracy can only be achieved in rule
                                                                       based techniques.   This motivates the need for a
                   Asameanstoﬁndtheanswerstothesequestions             statistical approach to POS tagging in Tamil.
                 we experimented with monolingual, bilingual and
                 multilingual corpus using various methods such          Various methods for bilingual POS tagging
                 as SVM model, HMM model, CRF model and                such as projection and induction have been used
                 Bilingual projection and probability re-estimation    to train highly accurate part-of-speech taggers
                 method (Yarowsky, 2001). The languages that we        (Yarowsky, 2001) for languages such as Viet-
                  namese (Dieng, 2003). As one of our methods we                          Tag          Description
                  use Yurowskys robust projection and probability                         NN              Noun
                  re-estimation technique to learn the POS tags for                      CNN         CompundNoun
                  Tamil in an semi-supervised manner.         There has                   PRN            Pronoun
                  been some recent work on bilingual (Snyder, 2008)                      CPRN CompoundPronoun
                  and multilingual learning (Snyder, 2009) where                          VRB              Verb
                  the results show that adding languages generally                        ADJ           Adjective
                  increases the accuracy when unsupervised learning                      ADV             Adverb
                  is done. There has been one attempt at bilingual                       CONJ          Conjunction
                  rule based POS tagger for Tamil using projection                         PP          Preposition
                  and induction techniques that quotes an increase in                    NUM             Number
                  performance (Selvam, 2009). However, we aim to                           X              Others
                  do a purely statistical approach to POS which does                        P      Punctuation marks
                  not require any prior knowledge of the grammar                      Table 1: Tagset used for Tamil corpus
                  rules.
                  3   Methodology                                            studied languages like Hindi, English and French
                                                                             we used existing pre-trained taggers. For Hindi we
                  We used the Universal Human Rights Declaration             used the tagger developed by the Society for Natu-
                  corpus (UDHR) which has been translated into over          ral Language Technology Research and for English
                  300 languages for our experimentation (UDHR,               and French we used the TreeTagger tool (TreeTag-
                  UDHRcorpus). The UDHR corpus consists of 75                ger, 1994). For Tamil, as no such pre-trained tagger
                  lines of short text translated in all the 300 languages    wasinausable form we had to hand tag the corpus.
                  of which we choose the text for our set of languages       Table 1 shows the set of 12 tags used for tagging
                  - Tamil, Hindi, English and French. The following          the Tamil corpus. These tags were chosen as they
                  sections describe in detail about the preprocessing        were the frequently occurring tags that also appear
                  step and the monolingual, bilingual and multilingual       in other languages. We tried to perform this tagging
                  learning approaches that we experimented with.             to the best of ability though some errors may have
                                                                             been performed in this step. These tags were used
                  3.1   Preprocessing                                        as the gold standard for all our experiments.
                  Before working on this data, we applied a prepro-
                  cessing step on the data to make it usable for our         3.2   Monolingual Supervised learning
                  experiments. We arranged the text by pairing the
                  Tamil text with the other 3 languages. So, we had          In this method we use the monolingual Tamil cor-
                  a total of 3 pair of languages. Sentence alignment         pus alone to perform supervised learning techniques
                  was done using Microsoft Researchs Bilingual Sen-          using various methods to estimate the maximum ac-
                  tence Aligner tool (Microsoft, 2003). The sentence         curacy that can be obtained using a single language
                  aligned ﬁles were given to the GIZA++ word aligner         and also to ﬁnd out which features of the language
                  and the union method was used to obtain the word           aid in increasing the tagging accuracy. For this pur-
                  alignments (Giza, 1999). The union method was              pose we split the dataset into training and test sets.
                  chosen over the intersection that would give a 1-1         Thetraining set comprised of 80% of the lines while
                  pairing because Tamil being an agglutinative lan-          the testing set comprised of 20% of the lines. Since
                  guage when paired with other languages which do            the corpus was small we used 10 -fold cross vali-
                  not possess that property would yield very low re-         dation to estimate the accuracies. We trained it us-
                  call whentheintersectionmethodofwordalignment              ing three well known models namely the Hidden
                  was used. The UDHR corpus was a plain text with-           Markov Model (HMM), Support vector machines
                  out any POS tagging done for the words. For well           (SVM)andConditionalRandomFields(CRF).
                    Strategy                 Description                    Feature                     Description
                   0: one-pass              default strategy                   1                        Actual word
                   1: two-pass      revisiting results and relabeling          2              1Previous Word + Actual word
                   2: one-pass      robust against unknown words               3              2 Previous words + Actual word
                   4: one-pass    very robust against unknown words            4              2 Previous words + Actual word
                   5: one-pass         sentence-level likelihood               5              4 Previous words + Actual word
                   6: one-pass     robust sentence-level likelihood            6                 1Nextword+Actualword
                        Table 2: Strategies used in the SVM Model              7                2Nextwords+Actualword
                                                                               8                3Nextwords+Actualword
                                                                               9       1 Previous word + 1 Next word + Actual word
                 3.2.1   HiddenMarkovModel(HMM)                                10                  1Preﬁx+ActualWord
                    We used a bigram HMM model along with the                  11                 2Preﬁxes + Actual word
                 viterbi algorithm to train the corpus.     Maximum            12           Preﬁxes + 2 Sufﬁxes + Actual word
                 likelihood estimator was used to determine the                13           Preﬁxes + 4 Sufﬁxes + Actual word
                 emission and transition parameters.The transition             14           Preﬁxes + 5 Sufﬁxes + Actual word
                 andemissionparameterswerecalculatedasfollows:               Table 3: Feature sets used in monolingual learning
                                 ′            ′            ′
                           P(t|t ) = count(t ,t)/count(t )                into a set of binary feature functions associating the
                            P(w|t) = (count(t,w)+δ)                 (1)   speciﬁedfeaturewiththeoutputcategory. Usingthis
                                       (count(t) +|V|∗δ)                  tool we built our training and testing ﬁles in the re-
                    After determining the emission and transition         quired formats and modelled and tested on a vari-
                 probabilities the probability of a given tag sequence    ety of combinations of features. The combination of
                 for a given word sequence was determined using the       features are listed in Table 3.
                 following formula:                                          Fromtheresultsobtained, we try to determine the
                         P(s,w) = Π (P(t |t       ) ∗ P(w|t ))            features that give a maximum increase in accuracy
                                       i     i i−1          i             for POS tagging.
                 3.2.2   SupportVectorMachines                            3.3   Bilingual Learning
                    WeusedtheSVMtoolwhichisageneralPOStag-                3.3.1   Supervised
                 ger based on Support Vector Machines to train and
                 test on our corpus. There were several modes of do-         For the supervised method of bilingual learning
                 ing the tagging in that tool. Each mode brought a        we used the same CRF++ tool described above.
                 little more complexity into the tagging. We used a       Tamil was paired with each of the other three lan-
                 set of six strategies to determine the one that gives    guages separately and the tags from the foreign lan-
                 the maximum accuracy. The six strategies are listed      guagewereprojectedontotheTamilwordsusingthe
                 in the Table 2.                                          word alignments. Then the training and testing ﬁles
                 3.2.3   Conditional Random Fields                        for the CRF++ tool were prepared and the template
                                                                          ﬁles were created considering the various combina-
                    For the conditional random ﬁelds we used the          tions of possible features that could affect the accu-
                 CRF++ tool which is a simple, customizable, and          racy of tagging. The feature sets that we tested on
                 open source implementation of Conditional Ran-           are given in the Table 4.
                 domFields (CRFs) for segmenting/labeling sequen-
                 tial data. CRF++ tool allows us to redeﬁne our own       3.3.2   Semi-Supervised
                 set of features. It requires the training and testing       For this we used the projection and aggressive
                 ﬁles to be in a speciﬁc format. It also requires us      tag probability re-estimation technique (Yarowsky,
                 to deﬁne a template ﬁle specifying the unigram and       2001). We used POS tag projection from an input
                 bigram features. For every unigram and bigram fea-       language (e.g.   English) to Tamil using the word
                 ture speciﬁed in the feature ﬁle the tool converts it    alignments computed during the pre-processing
The words contained in this file might help you see if this file matches what you are looking for:

...Anattemptatmultilingualpostaggingfortamil madhuramanathan vijaychidambaram ashishpatro department of computer sciences university wisconsin madison abstract introduction part speech pos tagging is the process providing every word in a corpus with labeling or other lexical class syntactic category our project we aim marker to each and sentence do supervised unsupervised methods an essential many applications using multilingual parallel corpora for tamil agglutinative language like recognition natural parsing ancient dravidian origin multilin information retrieval machine translation gual consists three languages namely hindi latin english perform french experimented on monolingual which spoken bilingual var southern india has existed ious models techniques such as hmm model svm crf pro over two thousand years sanskrit jection probability re estimation tech are considered longest surviving clas nique yarowsky anddidadetailedper sical from others formance comparison attempt cap indo aryan...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area