High Order Conditional Random Field Based Part of Speech Taggar and Ambiguity Resolver for Malayalam -a Highly Agglutinative Language

Main Article Content

Bindu. M.S
Sumam Mary Idicula


Parts of speech tagging also called grammatical tagging assign lexical class markers to each and every word in a document. It is an essential and important preprocessing step in many NLP systems. Tagged corpora play a significant role in Machine Translation, Information Retrieval, and Data Mining. POS tagging in Malayalam is a difficult task as it is an agglutinative language and 80-85% of words in Malayalam text documents are compound words. Decomposition of these words into its constituents is extremely necessary for finalizing the POS tag of these words. Sometimes more than one morphological analysis and hence more than one POS may occur for a single word. A correct resolution of this kind of ambiguity for each occurrence of the word is crucial in many NLP applications. Currently available tag sets in other languages are only giving importance to the morphological and syntactical properties of the language while the tag set designed by us considers the semantic features of the language. For testing this system, documents from well known Malayalam news papers and magazines are selected. Up to 2352 sentences are tested which includes simple, complex and compound type sentences. Word level tagging accuracy of 95% and sentence level accuracy of 91% are obtained.



Keywords: POS Tag set, finite state transducer, compound word splitter, Extended CRF, Malayalam compound word


Download data is not yet available.

Article Details