Part-of-Speech Tagging with Limited Training Corpora
Robert Staubs
2004-2005
Abstract
Background
Procedure
Expected/Actual Results
Result Chart
The aim of this project is to create and analyze various methods of part-of-speech
tagging. The corpora used are of extremely limited size thus offering less occasion to rely
entirely upon tagging patterns gleamed from predigested data. Methods used to analyze the data and resolve tagging ambiguities include Hidden Markov Models and Bayesian
Networks. Results are analyzed by comparing the system-tagged corpus with a professionally tagged one.
Many different methods of POS tagging have been advanced in the past but no attempts give hope of "perfect" tagging at the current stage. Accuracy of over 90% on ambiguous words is typical for most methods in current use, often well exceeding that. POS taggers cannot at the current time mimic human methods for distinguishing part of speech in language use. 
Work to get taggers to approach the problem from all the expected human methods—
semantic prediction, syntactic prediction, lexical frequency, and syntactical category
frequency being the most prominent—have not yet reached full fruition.
No results yet, therefore no result chart.
Training data consists of: tags
represented in the corpus, words represented in the corpus, transitions represented in the
corpus, and the frequency of each. Words and tags are read in from the corpus and stored
alphabetically or in parallel in a series of arrays and matrices. This data form the basis for the statistical information extracted by taggers for making decisions on a unit's tag.
Results expected are accuracy of over 90%. The effects for varying methods are
largely unknown. 
Actual results are as yet unknown.
There is extra space here for further expansion of these categories and
others as well as diagrams that may  be added later.
Id est, “this space intentionally left blank.”