Wednesday, March 25, 2009

Automatic Translation

All language translators out there currently on the web suck.

One idea for a very effective translator would be to feed a self-teaching AI program tons and tons of documents that already have existing translations, and have it automatically generate rules for proper translation. This would *automatically* accommodate for correct grammar, loose grammar, idioms, jargon, etc. This would require *a lot* of computing power, but it only has to be done *once*. Training documents can be found in existing corpora, or translated by hand specifically for the project. Two possible ways to generate rules would be: genetic algorithms, or some sort of exhaustion of many possible rule formulations (this could be bootstrapped with various types of data, for example, a word-sense->part-of-speech key, and a word-sense->popularity-of-use key.). Incidentally a week after I had this idea I heard a couple of people were working on just such a project, but I've yet to see the fruits of their results anywhere..

Rather than determining rules for translating to and from each possible combination of two languages, it's probably best to come up with *one* language that all languages can be translated to/from with no loss. Just making a collation of linguistic categories for words and clauses in each known language, and using these in an X-bar kinda structure, should be enough. Then any given language would be translated to this intermediate language, and then from that to the target language.

This greatly reduces the costs of furnishing texts for the learning algorithm and of the running it. This intermediate language's lexicon should be the superset of all the senses of all the words of the input languages, but with words identical in grammatical function and alike in meaning grouped into synsets, where each word in a synset is linked to each other word in the synset with a particular weight, which is their level of similarity (this may have to be done by hand). A word in a source text would, via a table, point to a word in some synset (if the word has any synonyms), and then the closest word to that (weight-wise), or the word itself, that some word in the target language points to, would be used.

A problem arises when a language possessing a certain tense/aspect/modality is translated to a language not possessing that. Possible solutions are: compromise and translate to a similar tense/aspect/modality that gets the point across, or totally rearrange the semantics of the sentence in the resultant text. This should not be too difficult given that the algorithm fully groks the grammatical structure of the sentence in the first place. Similarly some words won't exist in all languages. They can be handled by: using a similar-enough word that won't cause too much misunderstanding, or substituting the given word with a phrase (or, in some languages, possibly using agglutination).

Obviously I'm not implying that the semantic rearranging or phrase substitution would be wholly "figured out" by the translator; it would rely on a pre-programmed (or self-learned, via particular patterns found in the training texts) ruleset for such rearrangements.

"Similar-enough" words could be implemented using a weight mechanism just like the one used within a synset, but applying cross-synset/non-synonym. (In fact, we might as well just do away with a categorical consideration of synonym sets altogether.. unless lack of a bona fide synonym is used as a cue to look for a phrasal substitute?) Just enough vague linkages have to be drawn to accomodate all combinations of source/target languages. In fact for the sake of laziness, perhaps unlimited-length chains of weight-linkages could be used, when necessary. I suppose this requires a function for generating an overall priority value based on X number of Y-valued weights. For example, would we favor an a--(1)-->b--(2)-->c link, or an a--(3)-->d link? (1 means highest priority, because there is no bottom limit.) In this case, it would do to specify weights in arbitrary decimals rather than simply orders of priority.

We could effectively have myriad already-made translation texts available for training in this one-language approach, by creating a pretty darn good English<->the-one-language translator, and then using texts translated to and from English (it probably being the most widely translated language), with the English part being converted to/from the one language for the purposes of training. It remains in question how much trouble we should go through, if any, to make the program aware of whether a given training pair was actively translated from X to Y or from Y to X. This goes for the non-one-language approach also.

Machine learning may not be necessary: humans could perhaps construct perfectly comprehensive BNF notations (including idioms) and use a GLR parser, but I don't know how well this will work for (not-so-atypical) taking of grammatical liberties. If this approach is taken, the machine should obviously be programmed to understand affixes so that base words and inflections can be deduced for inflected words that aren't specifically in any dictionary. Another possible adaptation could be Damerau–Levenshtein distance or similar, to account for typos, misspellings, spelling variants, and OCR miscalculations. Also, a list of commonly misused words might also be helpful, though maybe not.

One trick to this translating could be to resolve ambiguous meanings, or connotations, of words in a sentence based on surrounding sentences. Meaning that if the word is used in such a way in a surrounding sentence that it definitely, or probably, means this or that, then we can induce that it probably means this in the given sentence, too. It could even be determined (by the given sentence or by a surrounding sentence) based on some pattern recognitions afforded by the training process. (These may even include subtle and holistic inferences.)

Meaning and grammar resolution can go both ways: grammar can help determine the sense of a word, and a known word sense could help determine the grammar of a sentence.

Connotation inferences (whether being done as-such, or effectively for consideration purposes but not internally tokenized on that level, per se) can even help determine the most germane translation synonym.

We *may* want to even layer our conferencing of meaning-resolution amongst sentences according to paragraph, chapter, document, author/source, and/or genre, but that's probably overkill, beyond just having a sentence-level tier and a document-level tier. Actually genre and source seem to be good too, since they're categories where you'd find words used in particular ways. Oh, I guess a sub-sentence-level tier could be relevant too (because the word could be used twice in the same sentence), but this layer would be treated a little differently of course, since self-contained syntax trees (mostly) start and end at the sentence level.

People can arbitrarily create new words on-the-fly in an agglutinating language. This would be hard for a translator to automatically substitute with defining phrases..but it would be easy to simply use a form of pseudo-agglutination in the given target language; for example, if poltergeist weren't already a well-known and established word, it would be translated into English as "rumble-ghost" or "rumble-spirit." Perhaps a little awkward, but I think it's pretty effective for getting a point across.


Heather said...

I like your idea of making a better language translator. I think that using a system where you input lots of text for the translator to make rules from is interesting. I'd like to see how it works exactly. I'm not sure about your "one" language idea, though. Wouldn't you still be dumbing down the original text quite a bit? There are already so many factors that go into making an accurate translation, that it takes quite a bit of effort to decide how best to translate something. If it had to go through that process twice, wouldn't a lot of the more subtle meanings get lost? I can't help but think that something would get "lost in translation."

inhahe said...

hmm, you might be right about the one language. The thing is, I was thinking that if the language supports all the words, and all the logical types of phrases and words, then nothing could get lost because everything could be expressed. but that's not necessarily true when it comes to subtle meanings that can only be encoded in phraseology/combinations of words. using direct language-to-language translation, some of these subtleties might get through, but using an intermediate language, that *is* one extra step for lossiness.

the advantage of using this intermediate system is great, though. if translations have to be done for the project, this exponentially reduces the number of translations needed. it also exponentially reduces the (relatively astronomical) amount of CPU time required for training. or given a certain amount of training documents, it can actually *increase* the accuracy if they're all to/from the intermediate language, because then it doesn't divide all its pattern learning amongst a dozen totally independent rulesets (one or two for each combination of two languages). so that would totally make up for the extra lossiness. obviously a plethroa of translated texts to/from this one language doesn't exist, but we could use translations to/from English (probably the most widely translated language), and develop an exceptionally good english<->intermediate-language translator.

and usnig an intermediate language might make the difference between making this task a reality or not--and if it's a reality, it might still be better than all existing translation services.

Anonymous said...

Hi friend How are You ? I like your post and i want to stumble it for my friend but i cant see your social bookmark widget in this blog. Please help me admin Thank You Steven