Some insight into our Digital Humanities Stream

by Louis Knölker, Student Research Assistant

As part of our research, one of our goals is to visualize the distribution and development of Bible translations and to make the resulting interactive world map available to researchers on a dedicated website. This will make it easier to analyse such relationships as those between colonial expansion and the global distribution of the Bible. We have sourced a number of books that list various Bible translations. These books provide lots of dates, names, and places associated with the translation of the Bible into about 1400 languages. It would have been conceivable for us to manually enter all the data of interest from our various data sources to the project into a database ourselves. This would have taken a lot of time that we can save with DH tools. Nevertheless, we still have to do a lot of work before the data can be feed into our database. In order to extract this selected data, it is necessary to ensure that the books were in a machine-readable format that can be further processed by the computer. To do this, we needed an optical character recognition tool (OCR). We opted for the open-source program OCR4all from the University of Würzburg. For our process, the following essential steps are required:

  1. Preprocessing
    In this step, each individual page we are interested in is categorized into different sections. This is necessary for a subsequent step of optical character recognition (OCR). In the preprocessing step, for example, a distinction is already made between the year of certain translations and the number of speakers of a particular language. This differentiation of different numbers would hardly be possible for the AI and must therefore be prepared manually.
  2. Text Recognition
    Now the first automated text recognition takes place, as is also known from programs such as Adobe Acrobat or similar. However, this first run is based on a very general model that is not yet tailored to the specifics of the book. Accordingly, there are still many errors, such as the confusion of O and 0. In this state, the data cannot be used for our purposes.
  3. Correction of the automated text recognition
    Any incorrect results from step 2 must now be corrected manually. However, this does not involve correcting the entire corpus that we created, rather only a fraction. These must be meticulously checked to see whether the text recognition matches the original and, if necessary, changes must be made. This quickly reveals typical errors that the program has repeatedly made, such as not recognizing special characters or accents.
  4. Training an adapted text recognition model
    The data resulting from the previous step is now used by the program to create a machine learning model with which the text recognition is tailored to the corpus. This significantly reduces the error rate.
  5. Application of the improved model
    The optimized model is now applied and the text recognition step is repeated.
  6. Correcting the result again
    Once again, a limited number of pages are corrected manually, thereby increasing the data set for further improvement of the AI model.
  7. Repeat steps 3-6 until the text recognition has been optimized
  8. Finalization

Theoretically, this should be the process. Unfortunately, not everything always works as intended and so we also encountered problems that we had to deal with. Unfortunately, step 4 only worked once and since then we have had problems with the program. Therefore, we had to correct many of the passages of the book that were relevant to our regional focus (Arctic, Australia and Oceania, West Africa) completely by hand.

1. Compiling a list of the relevant regional languages

Using a keyword search, I compiled a list of a few hundred languages from the regions mentioned.

2. Linking with the Glottolog

These languages were also linked to their respective data in the Glottolog, an open-access website that lists all the world’s languages and links them to identification codes, among other things. The linking of language and the so-called glottocode will facilitate future research and was planned for the final map. The difficulty in this step lies in the names of the languages. Languages often have different names, which may differ from our base sources and in the “Glottolog” database. Some names are also outdated or may only include one dialect today, meaning that they are no longer considered independent languages.

3. Manual correction of the respective sections

Correcting the individual pages is a relatively time-consuming process in which you always have to compare the original lines with the generated ones. As a result, you have to read every line of the corpus twice. This limits the reading pleasure, as does the often keyword-like text structure, so that you have to make a concerted effort to concentrate.

Nevertheless, there are always exciting passages and amazing anecdotes in our sources. I was often impressed by the great personal commitment of people who often translated the Bible into languages that only had a few hundred speakers. I would like to briefly present one of the best and extreme stories here, namely the story of the Bible translation into the Auca language:

Auca is one such micro-language with just 300 speakers, the Warani, who live in the Ecuadorian jungle. As early as the 17th century, there was peaceful contact between the Warani and a Jesuit priest who lived among them for several years. However, his successor was murdered and contact with the tribe was broken off. It was not until around 300 years later, in 1956, that there was another attempt to Christianize the Warani. A group of five missionaries led by Nathaniel Saint set out to convert the Warani, but all five members of the company were also murdered. Amazingly, this terrible event motivated Rachel Saint, Nathaniel Saint’s sister, and Betty Elliott, widow of one of the murdered missionaries, to try again – this time successfully. Thanks to the great efforts of these women, the entire tribe was evangelized ten years later and the Gospel of Mark was translated. In 1965, the children of the deceased Nathaniel Saint were baptized in the Curaray River at the site of their father’s murder. The baptism was performed by one of the men involved in the murder of their father.

From: Nida, Eugene A. (ed.) Book of a Thousand Tongues. 2nd ed., United Bible Societies, 1972, p. 69.

Leave a Reply

Your email address will not be published. Required fields are marked *