Computer Systems Project Proposal
Karen Shiells

I. Title: Creation of a Russian-English Translation Program


II. Problem Statement: The goal of the project is to create a machine translation
program that can identify structural attributes of unknown words in the source 
text.


III. Purpose: The primary goal of the project is to perform passable machine 
translation. The translation algorithm will be broken into a lexical scanner to 
interpret the source text, a syntactic transfer algorithm for the actual 
translation, and a text constructor for formation of the output.

The main distinction between this project and commonly available translation 
algorithms is the intended audience. While machine translation available on the
internet is generally designed for individual users who have no knowledge of 
the source language, and low standards for the translation. This project will 
produce a program intended for users who are familiar with both languages, and 
can provide useful input during translation in order to avoid major errors in 
the output. To this end, the translator will be able to identify unknown words 
and their structural properties, and present this information to the user to 
verify and input a literal translation for.

Since the program will ask for aide from the user in interpreting the source 
language, the finished product will only be useful to those who know the source
language, at a minimum, and ideally are familiar with the target language as 
well. Professional translators, whether working independently or with companies,
would benefit from a program that checks its interpretations rather than 
providing a quick, but low-quality translation. Language students, also, would 
be able to use this resource while still thinking about the texts, rather than 
plugging text to be translated into an automatic translator.


IV. Scope of Study: The program, since it will request user input, will not 
necessarily need to identify antecedents, grammar structures, and other 
information as consistently or correctly as programs that attempt to translate 
independently. Also, since the project also seeks to identify attributes of 
unknown words, time spent on the translation algorithm itself should be minimal
until all parts of the project have at least been attempted. Many teams of 
programmers with far more human and capital resources have spent much longer 
than the time allotted for this project attempting to produce effective machine
translation. This project does not attempt to surpass them; only to experiment 
with translation algorithms and add part of speech recognition to a more 
primitive translator.

Additionally, the range of grammatical structures, dictionary entries, and 
particularly idiomatic expressions understood and used by the program will be 
limited. At the end, if time remains, a few may be added for purposes of 
experimentation. For the most part, however, the program will parse and produce
only simple sentences, representing the most literal translations from the 
source language to the target.


V. Background: Almost all of the research performed in either machine translation 
or the broader field of computational linguistics could be incorporated into a 
project similar to this one. Much research has been done on part-of-speech 
analysis, and far more has been conducted in various methods of machine translation. 
This project will limit itself to translation by syntactic transfer and word 
identification using the syntactic structures stored in its dictionary.

The earliest machine translation algorithms relied on a direct translation 
approach. They were, essentially, specialized systems, translating directly 
from one language to another based on a literal dictionary and common structural 
changes between the two languages, such as shifting from subject-verb-object to
subject-object-verb sentence order. Because such algorithms do not so much as 
parse the original sentence, the output is frequently nonsensical, generally 
difficult for even humans to parse, and unable to accommodate complex structures.
Though sometimes acceptable translators for similar languages, such as French 
and English, direct translation becomes increasingly incomprehensible as it is 
expanded to additional languages.

From the age of international languages come the set of interlingua translators.
The main advantage of interlingua approaches is the increased flexibility 
provided by the intermediate representation. Interlingua chooses a representation,
often a constructed language or a grammatical diagram, to use as an intermediary.
They first translate the source into the intermediary, using algorithms specific
to the source language, and then from the intermediary to the target, again 
using a specialized algorithm. This approach, however, still relies on specific 
programs for each language, and adds an additional set of translation errors, 
introduced by the extra language. The wider the range of languages, the more 
general the intermediary needs to be. As multilingual systems grow, accurate 
translation becomes harder to accomplish because forms common to all supported 
languages are more difficult to find.

The syntactic transfer method is a more specific complement of interlingua. 
Syntactic transfer first uses a language-specific parser to generate a structural 
tree from the source text. The structural tree is then rearranged to remove any 
grammatical peculiarities of the original language. Using the resulting tree, a 
second program generates a text in the target language. Like interlingua, syntactic
transfer provides an intermediary representation of the text, though it is specific
to the particular language pair. Though syntactic transfer can produce the best 
quality translations, however, it is relatively new, and not as practical for 
programs with many language pairs. Syntactic transfer has not been the subject 
of as much experimentation as the other methods, partially because of legacy 
code, but also because it is less practical for more profitable multilingual 
translators.

Because only one language pair will be translated, this project will use a primarily 
transfer-based approach. Whereas other research projects in the Computer Systems
laboratory have been largely based on direct translation, this project will attempt 
to place an emphasis on parsing the input, rather than expanding the dictionary 
immediately. A syntactic transfer approach can, by generalization, be expanded 
into an interlingua approach. This project will attempt to remain flexible enough
that it may later be adapted to use interlingua.

As an aide for parsing, and for identification of unknown words, the program will
use a valency dictionary that is, a dictionary that provides the number and types
of complements that each verb takes. To aide valency analysis, the parser will use
a dependency grammar, which produces a tree in which nouns are dependent on verbs,
adjectives are dependent on nouns, and so on, rather than phrase-structure 
representation, which separates words into noun phrases, verb phrases, etc.


VI. Procedure and Methodology: The first program necessary for the project to 
proceed will be the dictionary, hopefully including a simple Java program to help 
review and generate entries. Since the parser uses the dictionary so heavily, 
including its details, the dictionary will need to be fully programmed in order 
to proceed with the translator itself. After the dictionary, the lexical analysis 
algorithm can be written, probably still in Java. The syntactic analyzer will be
in either Python or Java, and will generate and store the dependency tree that will
form the intermediary representation. Finally, the target language generator, back
in Java, will create the English text from the tree. After these are all operating 
on a basic level, the preliminary version of the unknown word analyzer can be added 
to the syntactic analysis stage.

In the second stage of development, attributes, including a separate program to 
manage attribute relations, can be added. Initially, part of speech and possibly 
gender will be the only attributes included. Additional attributes, such as verb
of motion, animate or inanimate, and other specifiers can make translation more
accurate and more likely to select the correct interpretation of each sentence.
Both the lexical and syntactic analysis stages, as well as word identification,
would need to be updated to include attributes. If the project reaches this point, 
the dictionary and parsing algorithms can be improved independently for as long 
as time allows.

VII. Expected Results: The final result of the project will be a working, if 
limited, Russian-English translator. Because it uses dependency, rather than phrase 
structure grammar, and predicts structures for unknown words, rather than omitting 
them, it does represent an unusual approach to translation. Its success or failure 
should exhibit the usefulness of the algorithms I have chosen. In addition to the 
research value, however, an interactive and basically functional Russian-English
translator would be a useful resource for translators, particularly with a dictionary 
that is easily manipulated by the user.

While the main focus of the project is on the translator, the dictionary itself,
if it includes a useful GUI, could be a valuable resource for Russian students.
Hopefully, the interactions between the lexical analysis algorithm and the dictionary
will remain efficient enough that the dictionary can be expanded without massive
memory expenditures. Since the dictionary will include part of speech, valency 
structures, and hopefully attributes of the words, as well as literal translations, 
it should provide even more information than would be necessary for a human translator.