94x Filetype PDF File size 0.09 MB Source: pages.cs.wisc.edu
AnAttemptatMultilingualPOSTaggingforTamil MadhuRamanathan,VijayChidambaram,AshishPatro Department of Computer Sciences University of Wisconsin Madison Abstract 1 Introduction Part of Speech (POS) tagging is the process Part of speech (POS) tagging is the process of of providing every word in a corpus with a labeling a part of speech or other lexical class syntactic category. In our project we aim marker to each and every word in a sentence. POS to do supervised and unsupervised methods tagging is an essential part of many applications of POS tagging using a multilingual parallel corpora for Tamil, an agglutinative language like speech recognition, natural language parsing, of ancient Dravidian origin. The multilin- information retrieval and machine translation. gual parallel corpora consists of three other languages namely Hindi, Latin, English and Our aim is to perform POS tagging for Tamil French. We experimented on monolingual, which is a Dravidian Language spoken in the bilingual and multilingual corpora using var- Southern part of India which has existed for ious models and techniques such as the HMM model, SVM model, CRF model and Pro- over two thousand years. Tamil and Sanskrit jection and Probability Re-estimation tech- are considered the two longest surviving clas- nique(Yarowsky,2001)anddidadetailedper- sical languages in India, from which the others formance comparison in an attempt to cap- Dravidian and Indo-Aryan languages have been ture the properties of the language that aid in derived. Tamil also has a rich set of literary works increased accuracy for POS tagging. Super- like the Thirukurral which have been manually vised CRF modeling using a variety of fea- translated into a number languages. Our aim is tures on a monolingual Tamil corpus revealed that word specific features such as prefixes to use such parallel corpus and build a method and suffixes produce an increase of 10% the to improve the accuracy of existing taggers that highest among all combinations of features. can be used for other applications like automatic Bilingual and multilingual learning shows that machinetranslation,speechrecognitionandparsing. the addition of other languages generally pro- duceadecreaseinaccuracymainlybecauseof Tamil uses a relatively free word order aggluti- the one to many association among the words while the other reasons being the drop in ac- native grammar, where suffixes are used to mark curacy produced at every stage of the vari- noun class, number, and case, verb tense and other ous pre-processing steps involved in accom- grammatical categories.Tamil words consist of plishing the word level pairing. The results of a lexical root to which one or more affixes are our experiments clearly reflect the relatively attached. Most Tamil affixes are suffixes. Tamil free word order and agglutinative nature of suffixes are of two types : derivational suffixes, the Tamil language and motivates the need which either change the part of speech of the word for a morpheme based POS tagger to attain a or its meaning, or inflectional suffixes, which mark greater accuracy. categories such as person, number, mood, tense, etc. There is no absolute limit on the length and extent chose are Hindi, English and French. Tamil follows of agglutination, which can lead to long words with a SOV word order and we chose Hindi as it a well a large number of suffixes (Tamil, Wikipedia). studied Indian Language with same word order. Much of Tamil grammar is extensively described We also choose two other languages that have the in the oldest known grammar book for Tamil, the SVOwordorder namely English and French to see Tolkppiyam. how much the word order property influences the accuracy of the results. The agglutinative nature of Tamil makes tagging a complex process. Various methodologies, both The remainder of the paper is organized into 5 statistical and rule based, have been developed and sections. Section 2 deals with the related work, widelyusedforPOSTaggingindifferentlanguages. section 3 talks about the method, section 4 about Tamil being a free form language with a large va- the experiments and analysis and section 5 gives the riety of morphological combinations, inflections concluding remarks. and exceptions, developing a rule based method for it would require a lot of effort and also extensive knowledge about the complex grammatical struc- 2 RelatedWork tures which makes it almost impractical. Supervised statistical methods require a large amount of reli- Tamil is one of the classical Indian languages which able annotated corpus that can be used for training has a very strong linguistic base with well defined purposes. At the same time a considerable large set of morpho-syntactic rules. However parsing, amount of sentence aligned parallel data (UDHR development of parsing models, chunking, gen- corpora, Bible corpora, Thirukural corpora, TV eration of Treebank, POS tagging, morphological news, newspaper articles,etc) are available in a analysis, and development of semi-automated and number of languages that we can put to use for automated tools for these processes in Tamil are this purpose. A large number of those languages at the nascent stage. The existing works on POS such as the European languages have pre-trained tagging is based on morphological analyzers which POS taggers that can be used to label the text in was built by Vasu Ranganathan (Renganathan, those languages. Consider these factors we tried to 2001) and Ganesan and RCILTS-T. Due to the con- address three main questions: straints, limited coverage of morpho-syntactic and semantic rules, non-availability of methodologies towards large scale development of parsing models, • When trained on a monolingual corpus what non-availability of standards, non applicability of properties/features of the language contribute statistical methods and resource deficiency, reported to increasing the POS tagging accuracy? tools cannot be used directly for all types of NLP • Does the addition of one or more languages applications. These existing tools have been devel- from a parallel corpus help in increasing the oped using rule based approaches. However, rule POS tagging accuracy? If the addition of lan- based techniques cannot address all inflectional and guages does improve the tagging accuracy then derivational word forms and peculiar characteristics are they any specific properties of the language like relative free word order, syntax with semantics being paired that lead to an increase in accu- and long distance relationship to a greater extent. racy? Moderate accuracy can only be achieved in rule based techniques. This motivates the need for a Asameanstofindtheanswerstothesequestions statistical approach to POS tagging in Tamil. we experimented with monolingual, bilingual and multilingual corpus using various methods such Various methods for bilingual POS tagging as SVM model, HMM model, CRF model and such as projection and induction have been used Bilingual projection and probability re-estimation to train highly accurate part-of-speech taggers method (Yarowsky, 2001). The languages that we (Yarowsky, 2001) for languages such as Viet- namese (Dieng, 2003). As one of our methods we Tag Description use Yurowskys robust projection and probability NN Noun re-estimation technique to learn the POS tags for CNN CompundNoun Tamil in an semi-supervised manner. There has PRN Pronoun been some recent work on bilingual (Snyder, 2008) CPRN CompoundPronoun and multilingual learning (Snyder, 2009) where VRB Verb the results show that adding languages generally ADJ Adjective increases the accuracy when unsupervised learning ADV Adverb is done. There has been one attempt at bilingual CONJ Conjunction rule based POS tagger for Tamil using projection PP Preposition and induction techniques that quotes an increase in NUM Number performance (Selvam, 2009). However, we aim to X Others do a purely statistical approach to POS which does P Punctuation marks not require any prior knowledge of the grammar Table 1: Tagset used for Tamil corpus rules. 3 Methodology studied languages like Hindi, English and French we used existing pre-trained taggers. For Hindi we We used the Universal Human Rights Declaration used the tagger developed by the Society for Natu- corpus (UDHR) which has been translated into over ral Language Technology Research and for English 300 languages for our experimentation (UDHR, and French we used the TreeTagger tool (TreeTag- UDHRcorpus). The UDHR corpus consists of 75 ger, 1994). For Tamil, as no such pre-trained tagger lines of short text translated in all the 300 languages wasinausable form we had to hand tag the corpus. of which we choose the text for our set of languages Table 1 shows the set of 12 tags used for tagging - Tamil, Hindi, English and French. The following the Tamil corpus. These tags were chosen as they sections describe in detail about the preprocessing were the frequently occurring tags that also appear step and the monolingual, bilingual and multilingual in other languages. We tried to perform this tagging learning approaches that we experimented with. to the best of ability though some errors may have been performed in this step. These tags were used 3.1 Preprocessing as the gold standard for all our experiments. Before working on this data, we applied a prepro- cessing step on the data to make it usable for our 3.2 Monolingual Supervised learning experiments. We arranged the text by pairing the Tamil text with the other 3 languages. So, we had In this method we use the monolingual Tamil cor- a total of 3 pair of languages. Sentence alignment pus alone to perform supervised learning techniques was done using Microsoft Researchs Bilingual Sen- using various methods to estimate the maximum ac- tence Aligner tool (Microsoft, 2003). The sentence curacy that can be obtained using a single language aligned files were given to the GIZA++ word aligner and also to find out which features of the language and the union method was used to obtain the word aid in increasing the tagging accuracy. For this pur- alignments (Giza, 1999). The union method was pose we split the dataset into training and test sets. chosen over the intersection that would give a 1-1 Thetraining set comprised of 80% of the lines while pairing because Tamil being an agglutinative lan- the testing set comprised of 20% of the lines. Since guage when paired with other languages which do the corpus was small we used 10 -fold cross vali- not possess that property would yield very low re- dation to estimate the accuracies. We trained it us- call whentheintersectionmethodofwordalignment ing three well known models namely the Hidden was used. The UDHR corpus was a plain text with- Markov Model (HMM), Support vector machines out any POS tagging done for the words. For well (SVM)andConditionalRandomFields(CRF). Strategy Description Feature Description 0: one-pass default strategy 1 Actual word 1: two-pass revisiting results and relabeling 2 1Previous Word + Actual word 2: one-pass robust against unknown words 3 2 Previous words + Actual word 4: one-pass very robust against unknown words 4 2 Previous words + Actual word 5: one-pass sentence-level likelihood 5 4 Previous words + Actual word 6: one-pass robust sentence-level likelihood 6 1Nextword+Actualword Table 2: Strategies used in the SVM Model 7 2Nextwords+Actualword 8 3Nextwords+Actualword 9 1 Previous word + 1 Next word + Actual word 3.2.1 HiddenMarkovModel(HMM) 10 1Prefix+ActualWord We used a bigram HMM model along with the 11 2Prefixes + Actual word viterbi algorithm to train the corpus. Maximum 12 Prefixes + 2 Suffixes + Actual word likelihood estimator was used to determine the 13 Prefixes + 4 Suffixes + Actual word emission and transition parameters.The transition 14 Prefixes + 5 Suffixes + Actual word andemissionparameterswerecalculatedasfollows: Table 3: Feature sets used in monolingual learning ′ ′ ′ P(t|t ) = count(t ,t)/count(t ) into a set of binary feature functions associating the P(w|t) = (count(t,w)+δ) (1) specifiedfeaturewiththeoutputcategory. Usingthis (count(t) +|V|∗δ) tool we built our training and testing files in the re- After determining the emission and transition quired formats and modelled and tested on a vari- probabilities the probability of a given tag sequence ety of combinations of features. The combination of for a given word sequence was determined using the features are listed in Table 3. following formula: Fromtheresultsobtained, we try to determine the P(s,w) = Π (P(t |t ) ∗ P(w|t )) features that give a maximum increase in accuracy i i i−1 i for POS tagging. 3.2.2 SupportVectorMachines 3.3 Bilingual Learning WeusedtheSVMtoolwhichisageneralPOStag- 3.3.1 Supervised ger based on Support Vector Machines to train and test on our corpus. There were several modes of do- For the supervised method of bilingual learning ing the tagging in that tool. Each mode brought a we used the same CRF++ tool described above. little more complexity into the tagging. We used a Tamil was paired with each of the other three lan- set of six strategies to determine the one that gives guages separately and the tags from the foreign lan- the maximum accuracy. The six strategies are listed guagewereprojectedontotheTamilwordsusingthe in the Table 2. word alignments. Then the training and testing files 3.2.3 Conditional Random Fields for the CRF++ tool were prepared and the template files were created considering the various combina- For the conditional random fields we used the tions of possible features that could affect the accu- CRF++ tool which is a simple, customizable, and racy of tagging. The feature sets that we tested on open source implementation of Conditional Ran- are given in the Table 4. domFields (CRFs) for segmenting/labeling sequen- tial data. CRF++ tool allows us to redefine our own 3.3.2 Semi-Supervised set of features. It requires the training and testing For this we used the projection and aggressive files to be in a specific format. It also requires us tag probability re-estimation technique (Yarowsky, to define a template file specifying the unigram and 2001). We used POS tag projection from an input bigram features. For every unigram and bigram fea- language (e.g. English) to Tamil using the word ture specified in the feature file the tool converts it alignments computed during the pre-processing
no reviews yet
Please Login to review.