Lexicographic Treatment of Kinship Terms in an English / Sepedi – Setswana – Sesotho Dictionary with an Amalgamated Lemmalist

This article describes the lemmatisation and treatment of kinship terms in a proposed English–Sotho, Sotho–English dictionary with an amalgamated lemmalist. The first requirement is to build a list of kinship terminology for the Sotho languages. Secondly, it is necessary in terms of space restriction to determine the most frequently used forms to be lemmatised in such a dictionary. Thirdly, the macrostructure and microstructure of the dictionary should be planned in terms of an amalgamated approach. A short explanation of the amalgamated model will be presented and a schematic illustration of the paternal family tree structure in the Sotho languages is given in the appendix. Specific attention is given to the compilation of the amalgamated lemmalist focusing on absolute cognates and absolute cognates with a difference in form. Finally, where the reduction of huge quantities of terms, e.g. all derived forms of a specific term in all three Sotho languages are at stake, a lexicographic convention will be suggested to sensibly reduce the number of lemmas and to combat redundancy.


Introduction
The aim of this article is to describe the treatment of kinship terms in an English-Sotho, Sotho-English dictionary with an amalgamated lemmalist.The kinship system in the Sotho languages is complicated, (see appendix), and was selected as an object of study in order to test the viability of the amalgamated approach for such complex structures.Prinsloo (2012) distinguishes three categories of kinship terms for Sepedi, i.e. underived single words such as malome 'uncle', rakgadi 'aunt' and tate 'father', derived words such as malomeagwe 'his uncle', morwediake 'my daughter' and bomalomeago 'your uncles' and phrases such as possessive constructions mogatša wa mokgotse wa ka 'my brother in law's wife'.He suggested specific lemmatisation strategies to cater for the large number of kinship terms in these categories in Sepedi, including a specific dictionary convention.An attempt to handle kinship terminology for three languages simultaneously is an even greater challenge since quantity wise the number of kinship terms to be lemmatised is threefold and new challenges on macrostructural as well as on microstructural levels come to the fore.The question is whether it is possible to do justice to all three languages in terms of similarities versus differences, following an amalgamated approach.
The first requirement for the lexicographer is to build a list of kinship terminology for the Sotho languages.It is also necessary in terms of space restriction to determine the most frequently used forms to be lemmatised in such a dictionary.In this article an attempt will be made to collect a number of kinship terms for the Sotho languages, i.e.Sepedi, Setswana and Sesotho.Secondly, the frequency of use of Sotho kinship terms in corpora for these languages will be determined.Thirdly, the treatment of Sotho kinship terms in separate English Sepedi/Setswana/Sesotho dictionaries will be studied in order to establish the viability of such an amalgamated approach and in order to suggest model dictionary articles.
The collection of Sepedi kinship terminology is mainly based on Prinsloo and Van Wyk (1992), Setswana on Van Wyk and Haasbroek (1990) and Sesotho on Molalapata (2004) supplemented by terms found in dictionaries and corpora of the Sotho languages.By way of introduction, a short explanation of the amalgamated model will be presented, followed by a schematic illustration of the paternal family tree structure in the Sotho languages.Finally, the formulation of model entries with an amalgamated approach will be presented.

The amalgamated model
The design of amalgamated dictionaries is credited to Martin and Gouws (2000) for introducing the concept and also for compiling the first amalgamated dictionary for Afrikaans and Dutch, Groot Woordeboek Afrikaans en Nederlands (ANNA).
The ANNA-approach is to provide treatment for what the amalgamated languages have in common first (A|N, A=Afrikaans, N=Nederlands (Dutch)) followed by the treatment of aspects applicable to the specific languages.Consider the article of ouderwets 'old fashioned' from ANNA.The article consists of three sections viz.A|N, N and A. Similarities and differences are indicated throughout by the symbols "=" 'equal' and "≠" 'differ' respectively.ouderwets bnw., ouderwets b.nw.
A|N (v.vroeger) ouderwets = ouderwetse kleren ouderwetse klere; een ouderwetse stoomtrein 'n ouderwetse stoomtrein; ouderwetse opvattingen ouderwetse opvattings; hopeloos ouderwets hopeloos ouderwets ≠ stewige ouderwetse meubels oerdegelijk meubilair N (net als vroeger) outyds, ouwêrelds = ouderwetse degelijkheid outydse deeglikheid; een ouderwetse winter 'n outydse winter ≠ het was weer ouderwets gezellig dit was weer gesellig soos in die ou tyd A (oulik; slim) bijdehand = 'n ouderwetse kind een bijdehand kind Detailed discussions of the amalgamated approach and of ANNA in particular can be found in Martin (2012a and2012b), Martin and Gouws (2000), Marais (2011), Bosman (2013) and in the user's guide of ANNA.Martin's intention with the amalgamated model was also to pave the way for other closely related languages: the aim was not only to produce a contrastive dictionary Afrikaans-Dutch, but also to lay the foundation for an exportable model, one that could be used for other closely related languages, such as the 'black' languages in South Africa: Xhosa and Zulu, and North-Sotho, South Sotho and Tswana etc. (Martin 2012b: 413) Amalgamated dictionaries would employ a single lemmalist for closely related languages such as Afrikaans/Dutch, Sepedi/Setswana/Sesotho, isiZulu/isi-Xhosa/Siswati/isiNdebele and have a unique microstructural architecture in their treatment of the languages in question.
The first requirement for an amalgamated approach is that the languages to be treated should be closely related, i.e. that they should have a substantial number of words in common.it can only be applied to closely related languages … both the 'form' of the words (spelling) needs to be the 'same' and at least one of the meanings.… there has to be a sufficient critical mass.(Martin 2012b: 414) Martin (2012b: 415) puts the overlap between Afrikaans and Dutch as 2/3, i.e. 66.7%.http://lexikos.journals.ac.zaFor the Sotho languages Prinsloo compared the 10,000 most frequently used words in Sesotho, Setswana and Sepedi corpora and came to the conclusion that the vocabulary of these languages overlap to a large extent.The three languages have 19,4% words in common, Sepedi and Setswana share 32,7%, Sepedi and Sesotho 26,9% and Setswana and Sesotho 34,4%.This degree of overlap would result in a single amalgamated lemmalist of 22,537 in contrast to a list of 30,000 lemmas (10,000 each for Sepedi, Setswana and Sesotho) if three separate dictionaries were compiled, thus a saving of almost 30%.For the English-Sesotho section the space saving stands at 67% compared to three English sections in three separate dictionaries, i.e.English-Sepedi, English-Setswana and English-Sesotho.Martin (ANNA: 25) distinguishes five different types of words relevant to an amalgamated approach, i.e.(a) absolute cognates: words in the related languages which are identical in form and meaning, (b) absolute cognates with difference in form, (c) partial cognates: words that differ in at least one sense, (d) non-cognates: words with the same meaning but clear difference in form and (e) false friends: words identical in form but which differ in meaning.
In this article the focus will be on absolute cognates, and cognates with a difference in form.
Absolute cognates are most beneficial to an amalgamated approach because a single lemma represents all of the related languages.For example, the translation equivalents for woman, love and neck in Sepedi, Setswana and Sesotho are identical.
woman n. mosadi love v. rata neck n. molala Likewise, efe in all three Sotho languages can be translated with a single equivalent: efe enum.which (one)? http://lexikos.journals.ac.zaAbsolute cognates are discussed in more detail in paragraph 4 below.
Absolute cognates with difference in form, e.g.kgaetšedi (Sepedi), kgaitsadi (Setswana) and kgaitsedi (Sesotho), 'sister/brother' also fit within an amalgamated approach but with some consequences for user-friendliness, cross-referencing and redundancy which will be discussed in more detail below.Partial cognates find their place in an amalgamated approach but require separate treatment for senses where they differ.Non-cognates do not bring much gain in an amalgamated approach since they have to be lemmatised and treated separately.See Prinsloo (2013) for a detailed discussion.

Kinship terms in the Sotho languages
An attempt was made to capture single-word kinship terms for Sepedi, Setswana and Sesotho from Prinsloo and Van Wyk (1992), Van Wyk and Haasbroek (1990) and Molalapata (2004) respectively.Their occurrence in the respective corpora was subsequently determined.Finally, a randomly selected number of dictionaries for each of these languages were studied in terms of their lemmatisation and treatment of kinship terms occurring more than once in these corpora.

Sepedi kinship terms
Single-word kinship terms from Prinsloo and Van Wyk (1992) that occur in the Pretoria Sepedi Corpus (PSC) and in one or more of five randomly selected Sepedi dictionaries are given in table 1 with their frequency counts and inclusion versus omission from the dictionaries marked as "√" and "x" respectively.
* frequency counts include homonyms which are not kinship terms

Setswana kinship terms
Single-word kinship terms from Van Wyk and Haasbroek (1990)  In all three Sotho languages ntate, rangwane and malome, respectively, have the same meanings, e.g.ntate 'father' in all three languages.Thus in the Sotho to English section space is saved in comparison to three separate dictionaries.
Ramogolo, rangwane and malome are lemmatised once instead of three times, each in a Sepedi-English, Setswana-English and Sesotho-English dictionary.In the English-Sotho section uncle is lemmatised only once instead of three times in three separate dictionaries (English-Sepedi, English-Setswana and English-Sesotho).
uncle ramogolo/rremogolo (father's older brother), rangwane (father's younger brother), malome (mother's brother) Ideally a single term for uncle in all three of the Sotho languages would have resulted in additional space saving as in the case of ntate '(my) father'.In the case of uncle semantic divergence does not lie on the level of differences between the Sotho languagesramogolo 'father's older brother' rangwane 'father's younger brother' and malome 'mother's brother' have the same meanings respectively in all three Sotho languages.They, however, refer to different relations in terms of the age of the related person and his position in the family tree, consider extracts from the family tree from Prinsloo and Van Wyk (1992) given in the appendix: 3) is malome.Ramogolo, rangwane and malome are lexicalised terms that could, for lack of equivalents, at best in English be described by means of a paraphrase "a man's father's elder brother", "father's younger brother" and "mother's younger brother".
In the case of absolute cognates with difference in form, the first consideration is the presumed knowledge of the target users.The more knowledgeable they are of one or more of the Sotho languages, (a) the more user-friendly an amalgamated lemmalist will be to them, (b) the less problematic it will be for the lexicographer to compile such a lemmalist and (c) the compilation of the lemmalist is less reliant on cross-referencing as lexicographic device to combat decontextualisation brought about by strict alphabetical ordering.
The key consideration, however, for the compilation of an amalgamated lemmalist for this type of cognates is the degree/extent of the difference in form.Martin (2012b: 14) categorises such words as items with a small, systematic spelling or morphological difference or items with a bigger, non-systematic difference but which are still recognizably similar in form.He gives Dutch pompoen 'pumpkin' and pinguïn 'penguin' versus Afrikaans pampoen and pikkewyn respectively as examples.There is only a minor difference between pompoen and pampoen but a substantial difference between pinguïn and pikkewyn.For the Sotho languages a closer look at the degrees of similarity/difference in spelling is required and will be attempted in a hierarchical order from "very similar" to "more substantial" differences.
The first instance pertains to words which differ only in terms of a diacritic sign, e.g.s versus š.Setswana and Sesotho use the same word ngwetsi 'daughter-in-law' versus Sepedi ngwetši.Here the same letter (s) occurs -the only difference is s with or without the inverted circumflex " v " and there is no need to lemmatise ngwetsi and ngwetši as separate lemmas with ngwetši as a main lemma directly following ngwetsi in the vertical layout of alphabetical ordering.This paradigm could simply consist of a presentation indicating the names of all three languages, ngwetsi[Set, Ses], ngwetši[Sep] or an unmarked ngwetsi followed by a marked occurrence for ngwetši, i.

e. ngwetsi, ngwetši[Sep].
There is also no need for cross-referencing.
The second type of typical examples are words which differ in terms of a single letter.This single letter could be (a) different, or (b) added/omitted.Sepedi and Setswana has ntatemogolo 'grandfather' compared to Sesotho ntatemoholo http://lexikos.journals.ac.za with "g" versus "h" as the only difference.As for mogatsaka versus mogatšaka such examples are less problematic -a single lemma paradigm will suffice e.g.ntatemogolo[Sep, Set], ntatemoholo[Ses].As for (b), consider Sepedi and Setswana morwa versus Sesotho mora 'son'.The lexicographer, with the abilities of the target users in mind, must decide whether a user looking for mora will find it under morwa.In the case of 'daughter' in the Sotho languages both (a) and (b) apply, i.e.Sepedi morwedi differs from Setswana morwadi in respect of one letter and Sesotho moradi in respect of one omitted letter to Setswana.The difference between Sepedi and Sesotho however comprises both (a) and (b) and the question is whether the user looking for the lemma moradi will find it under morwedi.Given the alphabetical remoteness of "a" in moradi from "w" in morwedi (almost at opposite ends in an alphabetical stretch in the dictionary), it could be argued that they should both be lemmatised with cross-reference from the untreated lemma to the treated lemma(s).
Since an alphabetical ordering is followed, the degree of similarity or likelihood of recognition as cognates is influenced by the position inside the word where the differences occur, i.e. at the beginning, middle or end of the word.Spelling differences at the end or even in the middle of words are less problematic, e.g.ntatemogolo versus ntatemoholo but differences in the first few letters pose a greater risk of the user not finding the lemma e.g.moradi versus morwedi where the difference lies within the first four letters.Sepedi ramogolo and Setswana rremogolo also only differ in one instance of (a) and of (b) but although ramogolo and rremogolo is relatively easily recognisable as cognates when seen together, the user who wants to look up rremogolo will probably not see ramogolo because ra-is alphabetically remote from rre-.
For the lexicographer the extent of utilisation of cross-references is in the first place a measurable one.The norm followed in ANNA (Martin 2012b: 419) is that only members of a specific lemma paradigm which are alphabetically more than seven positions away from the lemma paradigm where treatment is given, must be cross-referred to the lemma paradigm.The number of such cross-references represents a redundancy factor against the success of the amalgamated approach because additional dictionary space is utilised for such http://lexikos.journals.ac.za lemmas.Formulated differently, the more lemmas required to be entered separately from their lemma paradigms, the less successful the amalgamated approach will be because the ideal is to have a single lemma paradigm for each term for all three languages.
Cross-referencing is, however, intuitive in the sense of the presumed user's ability to find the lemma.In the case of ntatemogolo versus ntatemoholo it can be assumed that even the less knowledgeable user will be able to find the lemma but in cases such as ramogolo versus rremogolo the less sophisticated user should be assisted by including rremogolo as a lemma with cross reference to ramogolo.

Microstructural considerations
For short articles, e.g.consisting of little more than translation equivalents, such as for mme, ntate and malome given above, the success of an amalgamated approach is obvious.The question, however, is whether an amalgamated approach is still viable for longer articles.
Returning to the paradigm for kgaetšedi compiled above, consider the articles given for Sepedi (GNSW), Setswana (SESD) and Sesotho (SSED).The lemma paradigm in this example as well as its relatively short article has a high information density which can be paraphrased as follows.First, in terms of comment on form kgaetšedi, kgaitšedi [Sep]; kgaitsadi [Set]; kgaitsedi [Ses] account for, compare and contrast, the most frequently used terms for all three of the Sesotho languages.This is indicated by the clear, functional and space saving convention [Sep], [Set] and [Ses] in subscript.Secondly, noun class indication is given by a compact but clear convention.The boldfaced number indicates the class to which the lemmas belong, 2b and 10 the classes in which the plural forms occur.As for comment on semantics, the fact that the lemma can refer to a brother or a sister depending on the gender of the speaker is important and it is neatly explained by the brief contextual guidance given in brackets.Finally, the proverb given as an example of usage is well-selected because it is used in all three languages.The fact that the example is given in only one of the Sotho languages will not be problematic to the target user in this case because the forms are very similar in the other two languages.Thus no need to indicate the languages nor to attempt giving an example for each of the languages.Thus it saves dictionary space, also in terms of examples.Consider also the suggested articles for great grandfather and rakgolokhukhu: The treatment of the lemma great grandfather indicates that Sepedi uses the term rakgolokhukhu while Setswana uses rremogolo and ntatemogolo and that Sesotho also has the latter term with minor spelling variation, i.e. ntatemoholo.This is an example where one of the three Sotho languages employs a unique term for a specific relationship while the other two use different terms.The user wants to find the meaning of rakgolokhukhu and looks it up under R in the dictionary.The treatment indicates that it is a Sepedi word in class 1a with plural form in class 2b and that the English translation equivalent is great grandfather.It also informs him/her of the alternative mmelega rakgolo and its literal meaning.Finally in the spirit of the amalgamated approach, i.e. to highlight similarities and differences, an explicit cross-reference by means of the reference marker "" is given to the reference addresses for the Setswana and Sesotho terms rremogolo/ntatemogolo, ntatemoholo in the dictionary where more information can be found.

6.
Using the convention for lemmatisation of kinship terms in an amalgamated approach Prinsloo (2012) adapted the original ga/sa/se convention (Prinsloo and Gouws 1996) for the reduction of lemma paradigms for kinship terms in Sepedi.He indicated how a complicated set of derivations of malome 'uncle' such as malomeago 'your uncle', malomeagwe 'his/her uncle', bomalomeabona 'their uncles', etc. as well as a set of phrases involving malome could be reduced to a single lexicographic convention, i.e. bo/mma/mogatša ~ ago/agwe.In an amalgamated approach for the Sotho languages the question is whether three sets of complex derivations totalling more than 50 options could still be handled by a single convention taking the equivalents for brother/sister as a case in point.Such a convention requires detailed explanation in the users guide of the dictionary as has been done for the original ga/sa/se convention in POP.

Conclusion
The lemmatisation and treatment of kinship terms for a bi-directional dictionary bridging English and the Sotho languages in an amalgamated approach poses great challenges to the lexicographer on both the macro and microstructural levels.
On macrostructural level the first step will be to gather all single word basic terms, derived terms and phrases expressing kinship relations for all three Sotho languages and for English.The aim should be to compile a user friendly amalgamated lemmalist and that requires among other, insight and consideration of the presumed knowledge and dictionary using skills of the target user.Against this background of the user perspective the lexicographer should find a sound balance between the compilation of a lemma paradigm covering all three the Sotho languages versus separate lemmas, and utilisation of the medio structure.It is a matter of combating redundancy, i.e. to use less dictionary space for the lemmalist as long as user-friendliness in terms of the skills of the target user is not compromised.It has been argued in detail that the key consideration for the compilation of an amalgamated lemmalist is the degree/extent of the difference in form.
On the microstructural level the aim should be to achieve high text density which is still user-friendly and that clearly brings out differences and similarities between the amalgamated languages.Depending on the size of the dictionary more, or less comment on form and semantics could be given, i.e. longer or shorter articles as long as the information is well-balanced between the languages.
Where the reduction of huge quantities of terms, e.g.all derived forms of a specific term in all of the three Sotho languages is at stake, a lexicographic convention such as the adapted ga/sa/se convention could be used to combat redundancy and resolve the impossibility to lemmatise all the relevant forms.Care should, however, be taken that the compilation convention remains userfriendly, i.e. not attempting to include too many derivations.
The compilation of amalgamated dictionaries has great potential for African languages and the foundation laid by Martin's design and the publication of ANNA is a source of inspiration to apply the model to closely related languages such as the Sotho and Nguni languages.
that occur in the http://lexikos.journals.ac.zaPretoria Setswana Corpus (PSETC) and in one or more of five randomly selected Setswana dictionaries are given in table 2. Setswana and English-Sesotho dictionaries.Once again a 67% reduction is possible because the English lemmalist of kinship terms is presented only once.The challenge, however, is the compilation of the lemmalist on the Sotho side where the model requires amalgamation of the three separate lemmalists for Sepedi, Setswana and Sesotho into a single lemmalist.The five types of cognates identified in ANNA have different implications for the model and the first two are considered here.As briefly stated above, absolute cognates are the most beneficial because a single lemma can represent all three languages.Consider also the following frequently used and identical absolute cognates in the three Sotho languages.