Computer Systems Project Proposal Karen Shiells I. Title: Creation of a Russian-English Translation Program II. Problem Statement: The goal of the project is to create a machine translation program that can identify structural attributes of unknown words in the source text. III. Purpose: The primary goal of the project is to perform passable machine translation. The translation algorithm will be broken into a lexical scanner to interpret the source text, a syntactic transfer algorithm for the actual translation, and a text constructor for formation of the output. The main distinction between this project and commonly available translation algorithms is the intended audience. While machine translation available on the internet is generally designed for individual users who have no knowledge of the source language, and low standards for the translation. This project will produce a program intended for users who are familiar with both languages, and can provide useful input during translation in order to avoid major errors in the output. To this end, the translator will be able to identify unknown words and their structural properties, and present this information to the user to verify and input a literal translation for. Since the program will ask for aide from the user in interpreting the source language, the finished product will only be useful to those who know the source language, at a minimum, and ideally are familiar with the target language as well. Professional translators, whether working independently or with companies, would benefit from a program that checks its interpretations rather than providing a quick, but low-quality translation. Language students, also, would be able to use this resource while still thinking about the texts, rather than plugging text to be translated into an automatic translator. IV. Scope of Study: The program, since it will request user input, will not necessarily need to identify antecedents, grammar structures, and other information as consistently or correctly as programs that attempt to translate independently. Also, since the project also seeks to identify attributes of unknown words, time spent on the translation algorithm itself should be minimal until all parts of the project have at least been attempted. Many teams of programmers with far more human and capital resources have spent much longer than the time allotted for this project attempting to produce effective machine translation. This project does not attempt to surpass them; only to experiment with translation algorithms and add part of speech recognition to a more primitive translator. Additionally, the range of grammatical structures, dictionary entries, and particularly idiomatic expressions understood and used by the program will be limited. At the end, if time remains, a few may be added for purposes of experimentation. For the most part, however, the program will parse and produce only simple sentences, representing the most literal translations from the source language to the target. V. Background: Almost all of the research performed in either machine translation or the broader field of computational linguistics could be incorporated into a project similar to this one. Much research has been done on part-of-speech analysis, and far more has been conducted in various methods of machine translation. This project will limit itself to translation by syntactic transfer and word identification using the syntactic structures stored in its dictionary. The earliest machine translation algorithms relied on a direct translation approach. They were, essentially, specialized systems, translating directly from one language to another based on a literal dictionary and common structural changes between the two languages, such as shifting from subject-verb-object to subject-object-verb sentence order. Because such algorithms do not so much as parse the original sentence, the output is frequently nonsensical, generally difficult for even humans to parse, and unable to accommodate complex structures. Though sometimes acceptable translators for similar languages, such as French and English, direct translation becomes increasingly incomprehensible as it is expanded to additional languages. From the age of international languages come the set of interlingua translators. The main advantage of interlingua approaches is the increased flexibility provided by the intermediate representation. Interlingua chooses a representation, often a constructed language or a grammatical diagram, to use as an intermediary. They first translate the source into the intermediary, using algorithms specific to the source language, and then from the intermediary to the target, again using a specialized algorithm. This approach, however, still relies on specific programs for each language, and adds an additional set of translation errors, introduced by the extra language. The wider the range of languages, the more general the intermediary needs to be. As multilingual systems grow, accurate translation becomes harder to accomplish because forms common to all supported languages are more difficult to find. The syntactic transfer method is a more specific complement of interlingua. Syntactic transfer first uses a language-specific parser to generate a structural tree from the source text. The structural tree is then rearranged to remove any grammatical peculiarities of the original language. Using the resulting tree, a second program generates a text in the target language. Like interlingua, syntactic transfer provides an intermediary representation of the text, though it is specific to the particular language pair. Though syntactic transfer can produce the best quality translations, however, it is relatively new, and not as practical for programs with many language pairs. Syntactic transfer has not been the subject of as much experimentation as the other methods, partially because of legacy code, but also because it is less practical for more profitable multilingual translators. Because only one language pair will be translated, this project will use a primarily transfer-based approach. Whereas other research projects in the Computer Systems laboratory have been largely based on direct translation, this project will attempt to place an emphasis on parsing the input, rather than expanding the dictionary immediately. A syntactic transfer approach can, by generalization, be expanded into an interlingua approach. This project will attempt to remain flexible enough that it may later be adapted to use interlingua. As an aide for parsing, and for identification of unknown words, the program will use a valency dictionary that is, a dictionary that provides the number and types of complements that each verb takes. To aide valency analysis, the parser will use a dependency grammar, which produces a tree in which nouns are dependent on verbs, adjectives are dependent on nouns, and so on, rather than phrase-structure representation, which separates words into noun phrases, verb phrases, etc. VI. Procedure and Methodology: The first program necessary for the project to proceed will be the dictionary, hopefully including a simple Java program to help review and generate entries. Since the parser uses the dictionary so heavily, including its details, the dictionary will need to be fully programmed in order to proceed with the translator itself. After the dictionary, the lexical analysis algorithm can be written, probably still in Java. The syntactic analyzer will be in either Python or Java, and will generate and store the dependency tree that will form the intermediary representation. Finally, the target language generator, back in Java, will create the English text from the tree. After these are all operating on a basic level, the preliminary version of the unknown word analyzer can be added to the syntactic analysis stage. In the second stage of development, attributes, including a separate program to manage attribute relations, can be added. Initially, part of speech and possibly gender will be the only attributes included. Additional attributes, such as verb of motion, animate or inanimate, and other specifiers can make translation more accurate and more likely to select the correct interpretation of each sentence. Both the lexical and syntactic analysis stages, as well as word identification, would need to be updated to include attributes. If the project reaches this point, the dictionary and parsing algorithms can be improved independently for as long as time allows. VII. Expected Results: The final result of the project will be a working, if limited, Russian-English translator. Because it uses dependency, rather than phrase structure grammar, and predicts structures for unknown words, rather than omitting them, it does represent an unusual approach to translation. Its success or failure should exhibit the usefulness of the algorithms I have chosen. In addition to the research value, however, an interactive and basically functional Russian-English translator would be a useful resource for translators, particularly with a dictionary that is easily manipulated by the user. While the main focus of the project is on the translator, the dictionary itself, if it includes a useful GUI, could be a valuable resource for Russian students. Hopefully, the interactions between the lexical analysis algorithm and the dictionary will remain efficient enough that the dictionary can be expanded without massive memory expenditures. Since the dictionary will include part of speech, valency structures, and hopefully attributes of the words, as well as literal translations, it should provide even more information than would be necessary for a human translator.