Online Dictionaries on the Internet : An Overview for the African Languages *

Abstract: The main purpose of this research article is rather bold, in that an attempt is made at a comprehensive overview of all currently available African-language Internet dictionaries. Quite surprisingly, a substantial number of such dictionaries is already available, for a large number of languages, with a relatively large number of users. The key characteristics of these dictionaries and various cross-language distributions are expounded on. In a second section the first South African online dictionary interface is introduced. Although compiled by just a small number of scholars, this dictionary contains a world's first in that lexicographic customisation is implemented on various levels in real time on the Internet.

starting from an imaginary line north of the current Democratic Republic of the Congo (DRC) all the way down to the southern tip of the African continent.Roughly speaking, only the languages spoken in the Cape region and north of it (Afrikaans and the Khoesaan languages) do not belong to this family.He then set out to classify all the languages within this region, a classification mainly based on geographical contiguities, and much less on linguistic features.The result consisted of 16 'zones' covering nearly 80 'groups'.The zones start in the northwest (A), go to the northeast (B, C, D and E), then south (F and G), again from west to east (H, K, L, M, N and P), and once more from west to east (R, S and T).These zones are made up of groups (A10, A20, ...; ...; M10, M20, ...; ...), with each group bringing together so-called related languages (A11, A12, ..., A21, A22, ...; ...; M11, M12, ..., M21, M22, ...; ...).Over the years he extensively revised zones A, B and C (Guthrie 1953), and also -all of a sudden, but apparently in response to criticism (Cope 1971: 218) -collapsed the Southern African zones S and T into a single zone S. Guthrie's 'final' classification can be found in the third volume, pages 11 to 15, of his Magnum Opus (Guthrie 1967(Guthrie , 1971(Guthrie , 1970(Guthrie , 1970a)).
In Tervuren, Belgium, which soon became the mecca of Central-African language studies, a new zone was introduced around the region of the Great Lakes, zone J, consisting of Guthrie's groups E10, E20 and E30, as well as of sections of D40, D50 and D60.The numbering was simply transferred to J10 up to J60 respectively (Bastin 1978).In order to distinguish between neighbouring languages/dialects, extra letters are sometimes added (e.g.L31a for Cilubà spoken by the Balubà, L31b for Cilubà spoken by the Beena Luluwà, etc.).Since Guthrie, some languages have become extinct, while previously undocumented ones have been documented.Languages not originally in Guthrie's list mostly start with the linguistic group to which the extra language seems to be most affiliated, say E40, to which a third digit is added, e.g.E402 for Ikizu.At least, the latter is done by most scholars, such as for instance Lowe and Schadeberg (1996) or Maho (2003).
Nonetheless, in both the Guthrie and Tervuren checklists, the same code sometimes covers different languages.Furthermore, not everyone uses the Tervuren zone J.The result of this state of affairs is that there is considerable confusion as to which language has which code, and vice versa.Moreover, many languages often have numerous alternate spellings and/or are simply referred to by means of different names.The existence and status of dialects further complicate the issue.The exact number and location of languages is therefore still not known half a century after Guthrie's pioneering work, yet one generally accepts that there are at least five hundred and less than six hundred.Given all this confusion, it is obviously not truly possible to quantify any claims regarding this family of languages.For one, there is not even a fixed upper limit.
Apart from Guthrie's final classification, and Tervuren's latest checklist (Bastin, Coupez and Mann 1999), there is also a third classification that is often consulted, viz. the one found in Ethnologue (Grimes and Grimes 2000).A highly useful comparison of the three classifications was compiled by Maho (2002).In the discussion below, however, certain decisions had to be made in order to provide for a scientific framework.These decisions were as follows: (1) Ethnologue was used as arbiter on language names, (2) the codes for the languages were mainly taken from the Tervuren checklist, (3) wherever Guthrie's data seemed more precise, his 'language name + language code' pair was kept, and (4) where applicable, the current official language names overruled the above.

Internet dictionaries for the South African languages
Now that the term 'African languages' has been delimited for the purposes of this article, one can turn to the concept 'Internet dictionaries'.Such reference works form part of the larger family of human-oriented electronic dictionaries and, within a three-step access dictionary typology, can be characterised as reference works for which 'users worldwide use laptops/desktops to access a dictionary stored on an online server' (De Schryver 2003: 151).Reformulated, these are thus online dictionaries for which the data are stored in databases, no matter where these databases are located, and which can be consulted from a search screen by anyone from anywhere through the Internet.Intranet dictionaries, another type of online electronic dictionary, will thus not be considered.For convenience, however, the terms 'online dictionary' and 'Internet dictionary' are used interchangeably in this article.A comprehensive overview of the features of the various electronic dictionaries, as well as a detailed discussion of their advantages over paper dictionaries, can be found in De Schryver (2003).Suffice it to say here that an electronic dictionary is much more than 'a dictionary in electronic form'.At the very least, the data are stored in a database, to which various (search) indexes are added, with a multitude of links to multimedia, as well as, increasingly, Natural Language Processing (NLP) extensions.
Rather surprisingly, these various aspects already exist for some of the African languages spoken in South Africa, albeit not yet all together in one integrated Internet dictionary package.An online dictionary for Tshivenda (S21), for example, is available from CBOLD.It contains 8 900 lemma signs, all of them searchable from a search screen, yet only with textual output.Sound files were added to various basic travellers' phrases for Sesotho (S33), among others, at TravLang, while full multimedia (i.e.text, audio and computer graphics) can be found at eLanguage for isiZulu (S42).Lastly, an example of an online NLP aspect that has been developed for a South African language is the machine translation (MT) system running between isiXhosa (S41) and English at Xhosa on the Web! (O'Kennon 1996(O'Kennon -2003)).
As argued by Varantola (2002: 35) and De Schryver (2003: 167, 169-172), multimedia corpora will increasingly become part and parcel of future electronic dictionaries.This NLP aspect does not yet exist for South African languages, but across the border Internet-searchable text corpora are already available for ChiShona (S11-S12-S14) and SiNdebele (or Zimbabwean Ndebele (S44)).These online corpora of respectively 2.2 million and 0.7 million running words were originally assembled with dictionary compilation in mind, and have now been made available to the wider linguistic community (Ridings 2002).
Although most of the online dictionaries for South African languages have been online for quite some years now, it is somewhat disturbing to note that relatively few people know about their existence.Apart from the fact that the full Internet potential is not used within a single integrated package in any one of them, one of the reasons for their shadowy presence could be that none of these existing online dictionaries was made in South Africa, by South Africans, for South Africans.All these aspects are niches that can be filled by prospective lexicographers, besides the fact that such lexicographers can of course also improve on current size, quality and functionality.

A systematic overview of online African-language dictionaries
In this section, a systematic overview will be presented of currently available Internet dictionaries for the African languages.One immediately notices an uneasy balance between the concepts 'currently available' and 'Internet' here.Indeed, the Internet being an organic medium, its contents literally change every single second.One must therefore put a timestamp on the study, with all claims referring to that time frame.The timestamp is 'April 2003', as this is the period during which the Internet was trawled (with the help of search engines such as Google) to trace all available African-language Internet dictionaries.What follows is a summary and a discussion of the main findings, with all claims thus 'valid' for April 2003.
Before the results themselves are presented, it is important to recall that 'Internet dictionaries' in this study are only those online dictionaries that can be accessed from a search screen.This thus means that one must be able to type in words or sections of words, potentially including wildcards, followed by a mouse click or 'enter', upon which one or more articles are presented ensuing a page-reload.Based on this premise, the following two types of dictionaries that can be found en masse on the Internet have not been included in this study: (1) dictionaries in pdf (Portable Document Format), word processor, or any other downloadable text format -such as for instance Odden's (2002) Kikerewe-English Dictionary (J24) in pdf; and (2) dictionaries which are simply plain online HTML (HyperText Markup Language), or HTML-like, files -such as Ikuska Libros's (1997Libros's ( -2003) ) Diccionario Lingala-Español-Lingala (C36d) in HTML, or dictionaries such as those from the TravLang series mentioned above, which have no search facilities and can only be 'browsed'.
The following three types, on the other hand, were considered for this study: (1) online dictionaries, i.e. dictionaries stored in databases over the Internet; (2) pop-up dictionaries, i.e. dictionaries with which, once one has down-loaded a small piece of software, one can move the mouse over words online, upon which the relevant articles pop up in dedicated screens; and (3) PC dictionaries, i.e. dictionaries for which a piece of software cum one or more lexica are downloaded from the Internet, to be used as offline PC dictionaries.Note that the lexica in (2) can also be downloaded to the hard drive of a PC, at which point they become, in addition, functional as offline pop-up PC dictionaries.
Following the investigation, an impressive number of 182 African-language Internet dictionaries were found, 165 of the 'online' type, 8 of the 'popup' type, and 9 of the 'PC' type.All major characteristics of these 182 dictionaries have been tabulated in the Appendix, and as such this appendix -which is sorted by the names of the languages -should be considered the basis of the analysis that is to follow.These 182 dictionaries cover 117 different languages, as well as Common Bantu (CB) and Proto Bantu (PB).PB is the hypothetical language to which all current languages within this family can be traced back, while CB are the c. 2 800 series of comparative forms that were used by Guthrie to reconstruct PB.The distribution of the number of Internet dictionaries per language is as follows: Swahili (G42): 20 x, Chagga (E62): 14 x, Lingala: 5 x, Ganda (J15) and isiZulu: 4 x each, Meru (E61): 3 x, 18 other languages + PB: 2 x each, and 93 other languages + CB: 1 x each.As for many other real-world phenomena, one notices a Zipfean distribution, i.e. the number of Internet dictionaries is extremely high for just a small number of languages, while the frequency for the great majority is very low.That there are relatively many dictionaries for languages such as Swahili, Lingala and isiZulu is understandable; these are the languages that also receive much academic (and other) attention.That a language such as Chagga scores high, however, is out of proportion.
Indeed, there is some serious skewing in the geographical dispersion as a result of one single source that contains over a hundred African-language Internet dictionaries.In the early 1970s Derek Nurse and Gérard Philippson surveyed the languages of Tanzania and neighbouring countries -their study is known as the Tanzania Language Survey (TLS, Nurse and Philippson 1975)and this resulted in 124 parallel c. 1 000-word wordlists.For some of the languages, however, different dialects were recorded -in the case of Chagga, 14, in the case of Meru, 3, etc.In all, there are lexica for 97 different languages, as well as one for PB and one for English.Given this, it is thus clear that there is a significant bias towards the languages of Tanzania and East Africa.The fact moreover that Swahili is mainly spoken in Tanzania, pushes the distribution even more into that region of the African continent.
Despite the bias, and despite the small size of the TLS lexica, they are as a whole an interesting application of the hub-and-spoke model (Martin 1996: 209, 214).Indeed, with English/Swahili as hub, all the other 122 lexica are linked to it as spokes, and as a result an online dictionary for each and every language pair, triple, quadruple, etc. can now also be 'created', passing through the hub.The number of permutations, and thus the potential number of different multilingual dictionaries one can generate in this way, is virtually unlimited.The basic hub-and-spoke framework is actually becoming ever more popular online for dictionaries involving the languages used in the European Community (EC).In one set of applications, viz.Ergane and Majstro, Esperanto was chosen as hub with, besides mostly EC languages, Swahili, isiZulu and Setswana (S31) as spokes.From a sound metalexicographic point of view, there are many good reasons to have reservations when it comes to the hub-and-spoke model.Yet choosing an artificial language as hub, thus one where the level of polysemy is virtually non-existent, definitely goes some way to avoid a number of the theoretical problems.
While learners might find it most useful that English was included as one of the parallel lexica of TLS, comparative linguists surely appreciate the fact that Guthrie's PB reconstructions were also added, so that reflexes across the various languages can be directly compared.From the time when Guthrie worked on PB, reconstructions have mainly been drawn up in Tervuren, with Meeussen's (1980, based on a manuscript from 1969) BLR and Coupez, Bastin and Mumba's (1998) BLR 2 the two major releases so far.BLR 2, with 9 reconstructed forms, is the backbone of the ambitious CBOLD project, originally located in Berkeley, now transferred to Lyon.This research team collected a manifold of dictionaries, mostly as downloadable text files only however, and containing many errors resulting from the use of optical character recognition (OCR) on poor-quality scans.As pointed out above, such dictionaries have not been considered in this study.A total of 22 other dictionaries, as well as BLR 2, can be queried online though.Reconstructions to PB for these dictionaries, with BLR index numbers and Guthrie codes, is still ongoing.Note that, at the time of writing, a web site dedicated to BLR 3 is in the making (Bastin et al. 2003).
The CBOLD web site also houses the TLS data, which effectively makes this single site the 'major collection', at least quantity-wise, with 146 online dictionaries for 111 different languages and 2 for PB.In April 2003, the largest Internet dictionary for this language family, however, was located at Yale University, where The Kamusi Project contained 58 038 Swahili and 58 041 English 'articles' (Kamusi 1994(Kamusi -2001)).These values were arrived at by simply counting the number of entries, and do not reflect the true sizes since a new entry is used for each new synonym, for each new sense, etc.If the number of truly unique lemma signs is summed, regardless of part of speech (POS), then the Swahili to English side turns out to contain 18 411 items, and the English to Swahili side 26 970 items.This dictionary is a prototypical example of bottom-up lexicography (Carr 1997: 214), which means that it is being compiled by Netizens.The contents should thus be consulted with caution.
The second-largest online African-language dictionary, for Lozi (S34), contains 24 000 items.Then follow dictionaries for ChiShona with 15 000 items, for Nyankore (J13) with 12 500 items, etc.At the other end of the spectrum, some of the online dictionaries contain as few as 100 items (for Ganda), items (for Setswana), 300 items (for Lingala), etc.The average number of items in the 182 online African-language dictionaries is 1 978.
It has already been pointed out that the 182 dictionaries cover 117 different languages.Many of these languages are spoken across country borders, such as Chewa (N31b) which is spoken in both Malawi and Botswana, or Fipa (M13) in Tanzania and Malawi, Luyia (J32) in Kenya and Uganda, Yaka (H31) in the DRC and Angola, etc.If one studies the distribution of the number of languages that have online dictionaries per country, the data shown in Table 1 are arrived at.From Table 1 it is clear that the greatest allocation is once more to be found in Tanzania, with as many as 81 languages covered.Neighbouring countries such as Kenya with online dictionaries for 14 languages, and Uganda for 10 languages, also score high.In Southern Africa, countries like Zambia, Zimbabwe, Malawi and Mozambique, each cover more languages than South Africa, where there are but 4 languages with Internet dictionaries.
Based on the data found in Ethnologue, the 117 languages are spoken by over 100 million people.The dispersion once more moves between extremes.At one extreme, some languages covered are nearly extinct (Geviya (B30)), or are spoken by only a few (Zalamo (G33)), up to a few thousand people (Mpongwe (B11a), Kahe (E64), etc.).At the other extreme, some languages are spoken as primary language by over 5 million (Swahili, Sukuma (F21) and Gikuyu (E51)), over 6 million (Rundi (J62) and isiXhosa), over 7 million (ChiShona and Rwanda (J61)), up to over 9 million (isiZulu) people.Very roughly speaking, the average number of primary speakers per language for which there is at least one Internet dictionary is 1 million.
If one looks at dictionary typology, one notices that all but one of the 182 dictionaries is bilingual or multilingual.The only monolingual dictionary is the Duramazwi ReChiShona 'General Shona Dictionary' (Chimhundu 1999).Ironically, however, the interface of this monolingual dictionary is entirely in English.A full breakdown of the gloss and/or hub languages is shown in Table 2.As one could have expected, roughly nine out of ten dictionaries use English, and only one out of ten use French as the gloss/hub language.Unexpectedly, however, is the relatively large number of dictionaries that involve Esperanto.
None of the 182 dictionaries is stored on a computer in Africa.Even the electronic version of the Duramazwi ReChiShona was developed by The Norwegian Documentation Project, and is stored on a server in Oslo.Moreover, very few Africans were involved in the computerisation and creation of these online dictionaries.If one studies the various providers, one notices a clear bias towards academic institutions, which are responsible for eight out of every ten dictionaries.Dotcoms provide one out of seven dictionaries, and less than five percent are personal efforts.The exact distribution has been calculated in Table 3.In general, the soundest contents can be found for the Internet dictionaries compiled by academics, while the most versatile and appealing interfaces are those brought together by dotcoms.The average compilation year is 1981, with the distribution per decade as listed in Table 4.The number of users of the current online dictionaries is much higher than anticipated.For Swahili, for example, The Kamusi Project has received over 1.1 million visitors since mid-1995, the Freedict dictionary handles 700 visitors per day, while the Kamusi Kiswahili-Kiesperanto (Vessella 2001) is accessed at least 1 000 times per month.The online pop-up dictionaries for African languages available from Babylon have an average number of 1 400 users each.Lastly, Xhosa on the Web! (O' Kennon 1996Kennon -2003) ) has welcomed nearly 30 000 visitors so far.

The first South African online dictionary interface
From the overview presented above, at least two conclusions can be drawn.On the one hand, African-language lexicographers will have to admit that quite a substantial body of Internet dictionaries is already available.On the other hand, and this primarily from a South African perspective, one cannot deny the fact that the South African languages should and could be better represented as far as Internet dictionaries are concerned.TshwaneDJe, a Human Language Technology (HLT) development team, based in Pretoria and consisting of David Joffe, Gilles-Maurice de Schryver, D.J. Prinsloo and Salmina Nong, therefore decided to bring together all the material for the first South African Internet dictionary.
The choice fell on Sesotho sa Leboa (S32) as the first language for which to compile a dictionary, given that no online dictionaries were found for this language during the course of the Internet study summarised above.The gained expertise would then be applied to the compilation of other African-language Internet dictionaries.The starting point was Prinsloo and De Schryver's (2000) SeDiPro 1.0, a Sesotho sa Leboa to English dictionary available to the team in Microsoft Word format.Joffe wrote a parser to transfer the data to TshwaneLex, a novel and professional South African software application for dictionary compilation (Joffe, De Schryver andPrinsloo 2003, 2003a).TshwaneLex was designed in such a way that it can be used to produce hardcopy, CD-ROM as well as online dictionaries.On 22 April 2003, the first version of an Online Sesotho sa Leboa-English Dictionary was uploaded (De Schryver and Joffe 2003).Two months later, on 20 June 2003, the online dictionary was officially launched at the University of Pretoria.
Between the first upload and the launch, several adaptations were made and numerous extra features were added to the online dictionary.As such this dictionary is a direct implementation of the concept known as Simultaneous Feedback (De Schryver andPrinsloo 2000, 2000a), a methodology whereby especially indirect feedback is near-instantly 'fed back' into the compilation process of a dictionary.The lexicographic contents are currently being updated by Nong.
During the first two months, users primarily learned about the new online dictionary through word of mouth.On the eve of the launch, 366 different users had searched for 3 341 items, or on average 9.12 searches per person.This was equivalent to more than 50 searches by more than 7 different users per day.The first media release appeared two weeks later, on 4 July 2003 (cf.e.g.Mail and Guardian Online 2003).At the end of that day, the number of searches had already reached 5 779 by 802 different users, or an average of 78.09 searches by 12.15 persons per day.The great majority of these searches had been made from hosts in South Africa.This clearly exceeded even the wildest expectations at TshwaneDJe.
From a metalexicographic perspective, this online dictionary deserves some extra discussion.Firstly, it is the first African-language Internet dictionary that can be accessed in all languages covered by the dictionary.In this case, this means that all interface pages are available in both Sesotho sa Leboa and English.Primary speakers of Sesotho sa Leboa can thus for the first time consult a dictionary in their own language.
Secondly, although actually only the direction Sesotho sa Leboa to English exists, an English search index (which also includes support for multi-word units) has been added which makes it possible to search the dictionary as if the reverse side were also available.The layout of the output is also a first, as it shows how the senses in one language are spread all over the lexicon in another, and how these then again spread out, etc.With 24 921 items on the Sesotho sa Leboa side and 28 198 in the English index, this online dictionary becomes the largest African-language Internet dictionary.
Thirdly, besides a general-language dictionary, this is also the first online dictionary that includes a dedicated terminology list for an African language.The terminology list that has currently been added is one for linguistics, containing over 300 terms, and more terminology lists are planned.
Fourthly, when consulting the terminology list, users can choose between look-up and browse mode.This is thus an original implementation of Atkins's (1996) innovative view of future electronic dictionaries.According to her, "the user is in search of a specific piece of information" in look-up mode, while "a more relaxed reading takes place" in browse mode (1996: 529).In look-up mode users are furthermore re-routed from (potentially) incorrectly to correctly spelled items for words involving the letters s/š, e/ê and o/ô.
Lastly, and also most importantly, the terminology list contains a world's first for an online dictionary, namely the customisation of the output of part-ofspeech (POS) tags, usage labels and cross-references depending on the language chosen.As such, this is the first step towards one concept of the dictionary of the future, viz.Fuzzy SF (De Schryver and Prinsloo 2001).In Fuzzy SF, or Fuzzy Simultaneous Feedback, "log-file based Artificial Intelligence components enable the implicit retrieval of personalised user feedback with which the package customises each user's own and unique dictionary" (De Schryver 2003: 189).

Conclusion
In this article a near-exhaustive overview was presented of the current state-ofthe-art of African-language Internet dictionaries.The concepts 'African languages' and 'Internet dictionaries' were first defined for the purposes of this article.All currently available African-language Internet dictionaries were then reviewed, listed and compared to one another.Various statistics were calculated and distributions shown, from which one may conclude that there is a geographic bias towards the languages of East Africa, especially Tanzania.Among the most successful implementations one must count the hub-andspoke model as used for the presentation of the data from the Tanzania Language Survey, now part of the CBOLD web site.
A surprising number of 182 dictionaries were uncovered, for 117 different languages.The South African share was shown to be small.Although an estimated 100 million people speak the languages covered, just one of the dictionaries is a monolingual one.None of the dictionaries is stored in Africa, and few Africans contributed to the computational creation of these dictionaries.Most dictionaries are the output of academic institutions, are relatively recent, and have a higher-than-expected number of users.The most popular dictionaries are those for Swahili, for which there are as many as 20.
In order to turn the relatively inactive online lexicographic tide for the languages spoken in South Africa, it was indicated how the HLT development team TshwaneDJe decided to produce the first truly South African online dictionary interface.The language embarked upon is Sesotho sa Leboa.Compilation is undertaken within the frameworks of Simultaneous Feedback (SF) and Fuzzy SF, and it was shown how, in less than three months, the number of searches and users had already reached unexpected heights.The dictionary is currently the largest online African-language Internet dictionary.Among the novelties of the online Sesotho sa Leboa dictionary, the dual dictionary interface language (including the first in an African language), a layout inherently departing from an African language, the first searchable African-language Internet terminology list, the optional look-up and browse modes, as well as the first steps towards user customisation, were highlighted.As such, South African lexicography is already writing the future.

Table 1 :
Distribution of the number of African languages with Internet dictionaries per country

Table 2 :
Breakdown of the gloss and/or hub languages for all African-language Internet dictionaries

Table 3 :
Providers of African-language Internet dictionaries

Table 4 :
Number of African-language Internet dictionaries compiled per decade