Compiling a Bidirectional Dictionary Bridging English and the Sotho Languages : A Viability Study

The aim of this article is to investigate the viability of the compilation of a single bidirectional dictionary with a single lemma list for the Sesotho sa Leboa, Setswana and Sesotho → English side and a simultaneous treatment of the three Sotho languages in the articles of the English lemmas in the English → Sesotho sa Leboa, Setswana and Sesotho side of the dictionary. Specific attention will be given to selected macrostructural and microstructural aspects of such a compilation.


Introduction
The aim of this article is to study the viability of a bidirectional dictionary bridging English and the Sotho languages † : Sesotho, Setswana and Sesotho sa Leboa.The focus will be on the advantages and disadvantages of such a single dictionary compared to three comparative bidirectional bilinguals: English-Sesotho/Sesotho-English, English-Setswana/Setswana-English and English-Sesotho sa Leboa/Sesotho sa Leboa-English, and the additional value it would have in the absence of bilingual dictionaries bridging African languages with each other.This will be an important achievement since publishers generally do not regard the compilation of separate dictionaries bridging the African languages with each other as economically viable.Compiling a dictionary with a single lemma list for the Sotho languages at this stage can indeed become the forerunner to such an eventual goal, i.e. true bidirectional dictionaries bridging the African languages with each other.It also has the potential to pave the way for an English ↔ Nguni languages dictionary.
The analysis and design of the macrostructure and microstructure will be based on existing bilingual dictionaries bridging English and a Sotho language and will be aimed at the same target users.The viability study will firstly be performed for a combined article for the Sotho languages usable mainly for basic receptive information, i.e. treatment limited to a translation equivalent or two, and secondly for combined articles where a more exhaustive treatment is given.The bilingual dictionaries analysed are The New English-Northern Sotho Dictionary, English-Northern Sotho, Northern Sotho-English (NEN) (Kriel 1976) for Sesotho sa Leboa, Dikišinare ya Setswana English Afrikaans (DS) (Snyman et al. 1990) and Setswana-English-Setswana Dictionary (SESD) (Matumo 1993) for Setswana and Southern Sotho-English Dictionary (SSED) (Mabille and Dieterlen 1988) for Sesotho.
The results of this study will hopefully enable prospective compilers to decide whether it is worthwhile to compile such a dictionary and to provide guidelines and examples for such a compilation.It is not possible to do a detailed analysis of all relevant lexicographic aspects within the limitation of a journal article and the discussion will therefore be limited to a number of key microstructural and macrostructural aspects.
The compilation of such a dictionary will require the combined skills of mother-tongue speakers of all four languages and corpora for these languages.

Impact and range of application for the Sotho and Nguni languages
A bidirectional English → {Sesotho, Setswana and Sesotho sa Leboa}, {Sesotho, Setswana and Sesotho sa Leboa} → English dictionary is comparable to three bidirectional bilingual dictionaries, English-Sesotho, Sesotho-English, English-Setswana, Setswana-English, and English-Sesotho sa Leboa, Sesotho sa Leboa-English, thus two directions for the envisaged model versus six sides for separately bridging English and a Sotho language.It could be argued that separate bilinguals bridging English and each of the Sotho languages do exist but in most cases they are out of print or in need of revision.The situation for bridging Sotho languages with each other is much less promising.At this stage in the development of South African lexicography publishers' interest is virtually non-existent for bridging African languages with each other, thus little hope for bi-directional Sesotho sa Leboa ↔ Sesotho, Sesotho sa Leboa ↔ Setswana and Sesotho ↔ Setswana dictionaries.

Using the Dutch-Afrikaans dictionary as a design model
The envisaged English ↔ Sotho languages dictionary is a multifunctional dictionary where a dictionary consultation environment is created in which, in terms of Martin and Gouws (2000: 788), 'both differences and similarities become apparent in an efficient and contrastive way'.Reflecting differences and similarities will indeed be the key factor in simultaneous treatment of the Sotho languages.
A second important observation made by Martin and Gouws (2000: 790) is that 'the combinatory data represents the core of the lexicographic presentation'.Compare in this regard the following example of the approach of Martin and Gouws (2000: 790) where these principles are honoured for non-contrastive combinations, contrastive combinations and idiomatic expressions in the article of the lemma bril 'spectacles': Non-contrastive combinations are marked by '-' and only a Dutch example is given.Contrastive combinations are marked with '•' and '♦' marks idiomatic expressions.In this way different search zones (in a fixed order) are clearly marked in a user-friendly way and differences and similarities are clearly illustrated.
The envisaged English ↔ Sotho languages dictionary reflects a striking resemblance to, but also clear differences with, the Dutch-Afrikaans dictionary described by Martin and Gouws (2000).Among the similarities on macrostructural level count the compilation of a single central list, i.e. a single access structure, and consideration of different lemma types.The study for the Sotho languages differs from the model of Martin and Gouws in that the lemma list is for three and not for only two languages and that a full bridging with English is done.

Size and impact of a single lemma list
The prospective compiler of a dictionary with a single lemma list should firstly decide on the size of the lemma lists for both sides of the dictionary in order to compile a dictionary that would cover a reasonable percentage of use of the four languages in question.As a point of departure an assessment was made of the size of the lemma lists of dictionaries bridging English and Sotho languages as well as the top frequencies in English dictionaries such as Collins COBUILD English Dictionary (COBUILD2), Macmillan English Dictionary for Advanced Learners (MED) and Longman Dictionary of Contemporary English (LDOCE).
For English, the data given in COBUILD2 regarding the impact of the frequency bands give useful guidelines to the size and nature of an English lemma list for the envisaged dictionary.14 700 95 From Table 1 it is clear that the top 14 700 lemmas represent an astonishing 95% of the tokens or running words in a given English text.
The words in the five frequency bands are of immense importance to learners because they make up 95% of all spoken and written English.(COBUILD2 1995: xiii) The sizes of the lemma lists for the Sotho languages are reflected in Table 2.The impact of a single lemma list for the Sotho languages will be studied taking the top 10 000 words in the Pretoria corpora for each of the three languages as a point of departure.The number of types given in the final row of Table 3 reflects 100% of the use of the languages given in terms of tokens in the second row.This simply means, for example, that a Sesotho sa Leboa dictionary containing 150 000 lemmas would account for each of the 5.9 million words in the 327 texts that make up this corpus.The same analysis is applicable to Setswana and Sesotho from the figures given in Table 3.The question is what the impact of a lemma list consisting of only 10 000 lemmas for each of the languages will be in terms of token coverage.The comparable statistics for the Sotho languages are given in Table 4. From Table 4 it is clear that, as in the case of English, lemma lists compiled for the top 10 000 tokens in each of the Sotho languages represent more than 90% of the use of the language.It can therefore be argued that the selection of 10 000 lemmas for English, Sesotho sa Leboa, Setswana and Sesotho is viable in terms of considerable coverage of all four languages.

Lexical overlap in the Sotho languages
The second aspect studied on macrostructural level in the consideration of a consolidated lemma list is the percentage of words that the languages have in common, simply referred to as overlap.It stands to reason that the greater the overlap the better the chances of success for such a dictionary will be.
As a point of departure the overlap between Dutch and Afrikaans was studied since, as reported above, that project is regarded as a viable one.A comparison between Dutch and Afrikaans corpora reveal an overlap of 20%.Consider in this regard a selection of such mutual lexical items with high occurrence frequencies per million running words in Table 5.For the English → Sotho languages side of the dictionary a flying start exists, (technically speaking a 100% overlap), since the lemma list will only be 10 000 English lemmas and not three times 10 000 lemmas as for three separate dictionaries.
For the Sotho languages a comparison of the top 10 000 words in Sesotho, Setswana and Sesotho sa Leboa reveals that the three languages have 1 943 (19.4%) words in common.Sesotho sa Leboa and Setswana share 3 276 (32.7%) words.Sesotho sa Leboa and Sesotho have 2 689 (26.9%) words in common and Setswana and Sesotho share 3 441 (34.4%) words.This results in a single lemma list of 22 537 compared to a 30 000 lemma list in three separate dictionaries, thus a reduction of almost 30%.The trilingual overlap, i.e. words that all three languages share within the top 100 is 32%, i.e. the 32 words in (3).
(6) ntse, nngwe, tswa, ne, neng, teng, ena, utlwa, tsa, tsena, tse, tle, bua, rona Sesotho sa Leboa has 40 unique words and Setswana and Sesotho 39 and 41 unique words respectively.This renders a single lemma list of 194 lemmas.For this section it can be concluded that lemma lists based upon the top 10 000 tokens in English and the Sotho languages will render sufficient coverage of these languages and that the amount of overlap in the Sotho languages and the resulting single lemma list suggest that the compilation of an English ↔ Sotho languages dictionary is viable on macrostructural level.The prospective compiler, however, has to keep in mind that words which have the same orthographic form in the Sotho languages but different grammatical functions will have to be entered as more than one lemma depending on the lexicographic approach.For example, nna in (3) as a pronoun of the first person singular in the Sotho languages but also as a verb in Setswana.

The microstructure
On microstructural level preliminary tests indicate that the average article length in the envisaged English → Sotho languages side of the dictionary would vary between one-third and two-thirds of the combined article length of the English → Sesotho/Setswana/Sesotho sa Leboa sides of three separate dictionaries, thus a 30% -60% reduction.Compare the following randomly selected lemmas where unmarked forms such as pula 'rain' and motho 'a person' reflect a complete overlap between the three Sotho languages while double subscripts, e.g.gagwe 'his/her' and phela 'live', mark similarities between two languages and single subscripts, such as for hae 'his/her' and jang 'how', uniqueness in one language only.The real challenge lies in the successful compilation of the Sotho articles in the English → Sotho languages side of the dictionary.Failure to do so will simply result in articles reflecting the mere stacking of translation equivalents of English lemmas in Sesotho sa Leboa, Setswana and Sesotho, without consideration of crucial aspects of differences, similarities and combinatory data as highlighted in terms of Martin and Gouws (2000) above.There will thus be no gain in reduction and comparison.Consider Table 6 as an extract from the Concise Multilingual Dictionary (CMD) as a case in point.Firstly, translation equivalents are simply chronologically stacked for each language with considerable repetition in both the Sotho and Nguni languages without any attempt towards reduction.Secondly, a complete lack of communicative equivalence poses a great risk to the user to incorrectly use the equivalent(s).For example, tswalela has a limited range of application in Sesotho sa Leboa and cannot be used in all contexts as an equivalent of 'close', thus misleading the user.He/She is further misled by gross inconsistencies, e.g. in the final row where the compilers failed to add the word for water in isiXhosa, Sesotho sa Leboa and Setswana.The user would conclude that monate means 'pretty water' in Setswana while it only means 'nice, pretty'.
In spite of its shortcomings, it could be argued that an English → Sotho languages/Nguni languages/Afrikaans dictionary of this magnitude is a useful contribution in the complete absence of dictionaries bridging African languages with each other and could be improved by simultaneous treatment of the target languages.Consider the following attempt to improve CMD's articles for the lemmas coffee, page and verb in Table 7 versus example (8).Sensible reduction in these examples is achieved in terms of, among others, tonal indication, grammatical information, and translation equivalents.They should, however, be submitted to target users and the feedback obtained should be carefully studied.
In the Sotho languages → English side, the mediostructure (system of cross-referencing) can be fruitfully utilised to link the lemma with its equivalent in the other Sotho language(s), thus further strengthening the aspect of bridging Sesotho sa Leboa, Setswana and Sesotho with each other.
(10) hore [Ses] conj.that, in order that … cf.gore [SsL/Set] Selecting suitable dictionary conventions will be a crucial aspect in order to present user-friendly search zones.In the examples above, subscripts were used to mark the distinctions between Sesotho, Setswana and Sesotho sa Leboa.Similar layouts should be tested utilising a combination of different colours and standard conventions such as bold, underline and italics as in Figure 2 http://lexikos.journals.ac.zawhere the different languages are marked in the dictionary in colour and coloured shadings (which can unfortunately not be reproduced in this article) with the aid of running footers:

Conclusion
In this article macrostructual and microstructural aspects were analysed in terms of the viability of an English ↔ Sotho languages dictionary with a single lemma list for the Sotho languages and simultaneous treatment of the Sotho languages in the English → Sotho languages side.It can be concluded that such a compilation will be successful (a) for lemma lists of a reasonable size taken from all four languages as a point of departure, (b) because substantial lexical overlap exits between Sesotho sa Leboa, Setswana and Sesotho, (c) because treatment in terms of similarities, differences and combinatory data is possible, and (d) because user-friendly articles comprehensive enough for receptive and limited productive use can be compiled.

Figure 2 :
Figure 2: Using colour and colour shading in Sotho languages articles of English lemmas

Table 1 :
Summary of frequency band values in COBUILD2

Table 3 :
Sources, types and tokens in the Pretoria Sotho languages corpora

Table 4 :
Percentage of tokens represented by the top 10 000 types

Table 5 :
Afrikaans compared to Dutch: mutual lexical items, with frequencies per million running words(Gouws et al. 2004: 798)

Table 7 :
Concise Multilingual DictionaryThe aim of this viability study, however, is to compile more comprehensive articles, at least of the same size as existing bilingual dictionaries such as NEN, DS, SESD and SSED.The articles compiled in (9) for the English → Sotho languages and Sotho languages → English sides of the dictionary are still relatively speaking restricted to receptive use but they do go some way towards productive use and communicative equivalence.