Q&A – All the Word

Question: Why isn't All the Word developing a statistical system like Google Translate and all the other MT projects these days?

These days almost all of the machine translation (MT) projects are using statistical techniques. Systems of that type work reasonably well for pairs of languages that have large parallel corpora for training purposes. But we at ATW have developed a linguistically based rule system for four reasons:

1) Rule based systems produce higher quality translations: Commercial MT systems typically claim that their software increases the productivity of mother-tongue translators by factors ranging from 60 to 75%. For example, the KantanMT site (https://kantanmt.com/customersuccess.php) claims that their software boosts the productivity of their clients by 60 to 70%. However, when experienced mother-tongue translators use the texts produced by our software, their productivity is increased anywhere from 410% to 670%, and that's when we do tests with small books like Ruth, Esther, and Daniel. If we were to do a test with a large book like Genesis, that factor would be even higher.

2) Rule based systems don't require training corpora of any kind: Statistical systems always need training corpora of some type, and the larger the corpora, the better the results. As KantanMT and all the other companies with statistical systems tell their customers, "The more you train, the more you gain." But the languages that still need the Bible generally have very little literature of any kind. They certainly don't have large parallel corpora for training purposes. Often these languages have no literature at all, and many of them don't even have alphabets. When a missionary translator starts learning a language and begins analyzing the grammar, one of his/her first tasks is to develop an orthography (alphabet, tones, accents, spelling conventions, punctuation, etc.) for the language, and then teach the people how to read and write their own language. Statistical systems can't work for languages of that type, but a rule based system works perfectly fine in that situation.

3) Rule based systems are easily modified to accommodate related languages: We've found that modifying an existing grammar to accommodate a related language is a very quick and easy process. For example, modifying the Tagalog lexicon and grammar to accommodate Ayta Mag-Indi took approximately one sixth of the time that was required to build the initial Tagalog lexicon and grammar. During that modification process, we didn't need to modify any rules in the transfer grammar; we only modified rules in Tagalog's synthesizing grammar. So the rules in a rule based system can easily be modified for related languages. But the statistics produced by a statistical system can't be modified for a related language; each new language requires its own training corpora.

4) Rule based systems give linguists a very good starting point for a publishable grammar: SIL is the world's leader in documenting minority languages. They are respected world wide because they have produced so many good grammars. Some governments that are hostile to Bible translation are willing to let SIL members in because those governments want to know about the languages spoken by their people. A rule based system produces a set of grammar rules that very accurately describes a language. A grammar developed in ATW's software can be exported, and the resulting document is a very good starting point for a publishable grammar. But statistical systems produce statistics, which don't help document languages.

For these reasons, we at All the Word are developing a linguistically based rule system.

Question: What does your system use for source texts? Are you using the Greek and Hebrew texts?

We initially started this project using the Friberg Annotated Greek New Testament and the Louw-Nida Greek Lexicon. Later we included an annotated Hebrew Old Testament and the Brown-Driver-Briggs Hebrew Lexicon. But we encountered numerous problems with both texts and lexicons. Hebrew and Greek are natural languages, and every natural language is impoverished in certain areas. For example, many languages have multiple degrees of past and future tense, but neither Hebrew nor Greek have those. Many languages have "we-inclusive" and "we-exclusive," but neither Hebrew nor Greek have those. Many languages morphologically distinguish singular, dual, trial, and plural, but neither Hebrew nor Greek have those. Many languages indicate varying degrees of proximity (this here, that over there, that way over there, etc.), but neither Hebrew nor Greek have those. So we were constantly enriching the annotations in both the Hebrew and Greek texts. Additionally, both languages have lexicalized very complex concepts for which many languages don't have lexical equivalents. And the Greek text includes extremely long sentences which span multiple verses, and those must be shortened for people who are newly literate. We also wanted to include other texts such as Bible stories, stories that teach people how to prevent the spread of various diseases like AIDS and Avian Influenza, devotionals, commentaries, etc., but those texts were in English. Because we were using source texts that were in three languages, we were having to write three transfer grammars: one for Hebrew, one for Greek, and one for English. So we eventually decided to develop our own semantic representations using an English based metalanguage augmented with a feature system that was sufficiently enriched to accommodate a very wide variety of languages. So our semantic representations now consist of semantically simple concepts in structurally simple sentences. We're now able to build semantic representations for the Bible and many other texts, and each language only requires one transfer grammar. Building the semantic representations for the Bible and other texts is a very long process, but after we've finished analyzing a book, our software is able to produce high quality translations in a very wide variety of languages.

Question: How does your system translate the Bible? Is it like Google Translate?

No, our system isn't anything like Google Translate. Google Translate uses large bilingual corpora and statistical techniques to produce its translations. But the languages that still need the Bible don’t have large bilingual corpora, they generally have very little literature of any kind, so a statistical approach isn’t possible for these languages. Instead of a statistical approach, we’ve developed a linguistic approach that consists of two components: 1) semantic representations of the biblical books and other Christian literature, and 2) a linguistically based natural language generator.

1) Developing Semantic Representations of the Bible and other Christian literature:

In order for our software to produce an accurate and easily understandable translation of a text, we must first thoroughly analyze that text and capture its meaning using very simple words and simple sentence structures. Using our limited vocabulary and small set of sentence structures, we rewrite each biblical book by consulting numerous English translations, commentaries, translation helps, and the Greek and Hebrew texts. When the scholars are uncertain about the intended meaning of a passage (e.g., Luke 23:31), we include alternates which capture the major opinions regarding the intended meaning. After insuring that all of the information in the original text is present in our new version, we add very detailed linguistic information so that our software will have access to the information it needs as it translates that text into a wide variety of languages. Every word, phrase, and clause is marked and thoroughly described by numerous features. For example, every noun is marked for Number, and the possible values are Singular, Dual, Trial, Quadrial, Plural, and Paucal because some languages morphologically distinguish each of those values. The result of this analysis is called a semantic representation, and it serves as the source document that our software uses during the translation process. Building the semantic representation of a biblical book is a very time consuming process, but we only have to do it once for each book. Then our software will be able to translate that book into a very wide variety of languages.

2) A Linguistically Based Natural Language Generator

A natural language generator is a computer program that produces translations of texts using the lexicon and grammar of the target language. We’ve developed a natural language generator which enables linguists and missionaries to build computational lexicons and grammars for their languages. After a linguist determines the rules for a particular language, he can enter those rules into our software, and then our software is able to executes those rules. Our software includes rules for adding prefixes, suffixes, infixes, circumfixes, etc. Other rules put all the constituents in their proper order. There are rules to identify where pronouns should be used, and the surface forms of those pronouns. There are also collocation correction rules which change one target word to another target word in specific environments. There are rules for handling the various relativization strategies for relative clauses. So there are numerous types of rules in our grammar. Then our software applies that information to our semantic representations of the Bible, and produces initial draft translations that are easily understandable, grammatically perfect, and convey the same information as the original text. The computer’s draft must then be edited by mother-tongue speakers in order to make it more natural and relevant to the target culture. The result is an easily understandable translation in a fraction of the time required by manual translation.

Question: Is your system able to translate the entire Bible?

Our software is able to translate all of the books that we've built semantic representations for. At the present time we don't yet have semantic representations for the entire Bible, so our software can't yet produce an entire Bible. We currently have semantic representations for Genesis, Ruth, Esther, Daniel, Nahum, Luke, and six Pauline epistles. We're currently developing semantic representations for Exodus, Judges, 1&2 Samuel, and a series of Bible stories. Our ultimate goal is to produce semantic representations for the entire Bible, commentaries, Bible study materials, and classic Christian literature. Then our software will be able to quickly translate all of those materials into a new language.

Question: How does your system handle poetry?

All translators strive to accurately translate the meaning of the source text rather than the form of the source text. Poetry is a form, and it's virtually impossible for even human translators to produce a text in a target language that has the same rhyme and rhythm as the original text.

Hebrew poetry generally uses patterns of stress and meaning rather than rhyme or rhythm. Whether a translation is produced by a human or by our software, it most likely will not have the same pattern of stress as the original text. While unfortunate, this problem occurs for all translations that make it a priority to preserve the meaning, including most English translations produced by scholars.

Providentially, the dominant characteristic of Hebrew poetry is semantic. As Derek Kidner states in his commentary on the Psalms, "this type of poetry loses less than perhaps any other in the process of translation" and "survives transplanting into almost any soil." The semantic patterns, often called "parallelisms," match one thought with another thought. The second thought might echo the first thought, or build to a stronger conclusion, or state an opposite proposition, or fill in details. ATW translations will, in most cases, be able to preserve these semantic patterns.

Another dominant feature of Hebrew poetry is metaphor, when we say that something "is" some other thing. We try to preserve metaphor in our semantic representations unless the metaphor is specific to the Hebrew culture so that it is unlikely to be understood by other cultures. In that case, we might replace the metaphor with a simile, saying that something "is like" some other thing. Where we think that the metaphor might or might not be understood, we include an alternate version so that the translator can make a choice for his or her specific language. For instance, in Psalm 23:1, our translation with the metaphor is “Yahweh is my shepherd.” But we include an alternate representation with a simile: “Yahweh is like a shepherd. He cares for me.”

Question: Why are you focusing on the Old Testament?

There are thousands of missionaries around the world working to manually translate the New Testament. If a missionary is making good progress, or is close to being finished, he or she doesn't want computational help with the remaining New Testament books. But we've found that nearly all of the translators warmly welcome computational help with the Old Testament. So for the near future, we will focus on the Old Testament so that we can provide assistance to the missionaries who are already working on the New Testament. After we have semantic representations for the entire Old Testament, we'll start building semantic representations for the New Testament, commentaries, and other Christian literature.

Question: How long does it generally take a person to start producing translations using your system, and what skills are necessary?

Answer: We'll answer this question based on Tod Allman's experiences with English, Tagalog, and Ayta Mag-Indi, but please keep in mind that there are many factors which can either increase or decrease the amount of time required for a particular language project.

In order to use our system, the first step is to become familiar with our software and semantic representational system. A computational linguist can become familiar with our system in approximately 40 to 50 hours. We have videos and tutorials on our website so that people can learn how to use our system. In order to build a grammar for a language, a person needs to have some experience developing software. Building a grammar for a language is similar to developing a small software project. A grammar consists of a sequence of rules, so in order to build a grammar, a person has to be able to think in terms of sequenced rules, just like we do when developing software. After a person is familiar with our system, he'll be ready to start a language project.

When starting a new language project, we first work through a series of about 300 simple sentences that we call the Grammar Introduction. These are very simple sentences that illustrate the various tenses, aspects, moods, illocutionary forces, noun-noun relationships, pronouns, adjective and adverb degrees, noun proximity values, etc., that occur in our semantic representational system. The Grammar Introduction concludes with a synopsis of the David and Goliath story so that we produce an actual text. Tod worked through the Grammar Introduction in English in about 40 hours, but he knows the software thoroughly. For people who don't know how to use the software well, it will take longer. If the person is a mother-tongue speaker of the language he's working on, the work will certainly progress more quickly. When Tod was developing the lexicon and grammar for Tagalog, the Grammar Introduction required approximately 30 two hour meetings with a mother-tongue speaker. After each meeting, Tod would typically work for another two or three hours to implement all the rules they had discussed. So the total time for working through the Grammar Introduction in a language you're not familiar with is about 150 hours.

After completing the Grammar Introduction, we generally work through a very short simple story about preventing eye infections. This story is simpler than the texts in the Bible, so it's beneficial to work through this story before starting a biblical text. For Tagalog this story required approximately 10 two hour meetings, and again Tod worked for another couple of hours after each meeting. So the total time to work through this story in a language you're not familiar with is about 50 hours. For Ayta Mag-Indi, Tod didn't need to work through the Grammar Introduction because the language is structurally very similar to Tagalog. For the story about eye infections, Tod needed 7 two hour meetings with the missionary working in the language, and he worked for a couple hours after each meeting to implement the rules. So the total time for that project was about 30 hours.

After completing the story about preventing eye infections, you're ready to begin working in a biblical book. We generally start with Ruth because it's a short, simple historical narrative. For Tagalog we needed approximately 50 two hour meetings to produce a draft of Ruth. And again Tod worked for several hours after each meeting, so the total time involved was about 200 hours. By the time you've finished Ruth, the translation process is definitely accelerating. While working in the Grammar Introduction, each sentence requires several hours because you have to build so many rules. But when you're finished with Ruth, the grammar is pretty thorough, and you can generally work through several verses in a two hour meeting. As you work through additional texts such as Luke, Esther, Daniel, Genesis, etc., the pace of the translation work definitely accelerates because the lexicon and grammar are well developed. At that point you'll be able to work through several chapters during a typical two hour meeting. If we had semantic representations for the entire Bible, then eventually you could click a button, and the computer would work for a couple of days, and you'd have an initial draft translation of the entire Bible. Mother-tongue speakers could then edit that draft into a publishable text in a fraction of the time required for manual translation.

Question: I'd like to receive your newsletters. How can I sign up?

If you'd like to receive our quarterly news and prayer letters, please send an email to "info" at "AllTheWord.org" with "ATW newsletter" in the subject line. We'll be glad to send you our newsletters.

Question: I'd like to become involved in your ministry. Can I talk to someone about the possibilities?

If you'd like to become involved in this ministry, we'd be glad to hear from you. Please send an email to "info" at "AllTheWord.org" and describe your interests. We're always looking for Bible translators, Bible exegetes, software developers, computational linguists, and people with other skills that can contribute to this ministry. So please send us an email describing how you'd like to be involved.

For general information, please contact:

Tod Allman

Email: Tod_Allman at AllTheWord.org

For financial inquiries, please contact:

Richard Denton

386-402-4299

Email: Richard_Denton at AllTheWord.org