Part-of-Speech Tagging with Limited Training Corpora Robert Staubs 2004-2005 Abstract Background Procedure Expected/Actual Results Result Chart The aim of this project is to create and analyze various methods of part-of-speech tagging. The corpora used are of extremely limited size thus offering less occasion to rely entirely upon tagging patterns gleamed from predigested data. Methods used to analyze the data and resolve tagging ambiguities include Hidden Markov Models and Bayesian Networks. Results are analyzed by comparing the system-tagged corpus with a professionally tagged one. Many different methods of POS tagging have been advanced in the past but no attempts give hope of "perfect" tagging at the current stage. Accuracy of over 90% on ambiguous words is typical for most methods in current use, often well exceeding that. POS taggers cannot at the current time mimic human methods for distinguishing part of speech in language use. Work to get taggers to approach the problem from all the expected human methods— semantic prediction, syntactic prediction, lexical frequency, and syntactical category frequency being the most prominent—have not yet reached full fruition. No results yet, therefore no result chart. Training data consists of: tags represented in the corpus, words represented in the corpus, transitions represented in the corpus, and the frequency of each. Words and tags are read in from the corpus and stored alphabetically or in parallel in a series of arrays and matrices. This data form the basis for the statistical information extracted by taggers for making decisions on a unit's tag. Results expected are accuracy of over 90%. The effects for varying methods are largely unknown. Actual results are as yet unknown. There is extra space here for further expansion of these categories and others as well as diagrams that may be added later. Id est, “this space intentionally left blank.”