A Critical Evaluation of the Paradigm Approach in Sepedi Lemmatisation — The Groot Noord-Sotho Woordeboek as a Case in Point

This article gives a critical evaluation of the paradigm approach of the Groot NoordSotho Woordeboek to the lemmatisation of verbs and nouns derived from verbs. The verb stem -roba 'break' with its complicated system of derivations will be taken as a case in point. The paradigm presented for -roba will be evaluated in terms of structure, occurrence in Sepedi corpora and dictionaries, actual use by mother-tongue speakers, user-friendliness, contextualisation versus decontextualisation in relation to the cross-referencing system and space utilisation. Bringing together, and lexicographically treating all these forms for a single verb surely is a lexicographic achievement. The question, however, is to what extent such an approach is useful in respect of forms likely to be looked up by dictionary users, whether all of these forms actually exist, how user-friendly the approach and presentation is, if comment on semantics is sufficient and consistent and whether such a lumping approach actually saves space in contrast to entering derivations as main lemmas in a splitting approach.

Hierdie artikel gee 'n kritiese evaluering van die paradigmabenadering tot die Groot Noord-Sotho Woordeboek tot die lemmatisering van werkwoorde en naamwoorde wat van werkwoorde afgelei is.Die werkwoordstam -roba 'breek' met sy komplekse sisteem van afleidings word as voorbeeld geneem.Die paradigma wat vir -roba aangebied word, sal in terme van struktuur, werklike gebruik deur moedertaalsprekers, voorkoms in Sepedikorpusse, gebruikersvriendelikheid, kontekstualisering versus dekontekstualisering ten opsigte van die kruisverwysingstelsel en ruimtebenutting geëvalueer word.Die byeenbring, en leksikografiese bewerking van al hierdie vorme vir 'n enkele Introduction The aim of this article is to give a critical evaluation of the paradigm approach to the lemmatisation of verbs, and nouns derived from verbs (deverbatives) in the Groot Noord-Sotho Woordeboek/Comprehensive Northern Sotho Dictionary/ Pukuntšu ya Sesotho sa Leboa (Ziervogel and Mokgokong 1975).The complicated verb stem -roba 'break' will be taken as a case in point.The paradigm presented for -roba will be evaluated in terms of (a) structure, (b) real life use as reflected by occurrence in Sepedi (also referred to as Northern Sotho or Sesotho sa Leboa) corpora and dictionaries, as well as actual use by mother-tongue speakers, (c) user-friendliness of the paradigm approach in respect of lumping versus splitting, (d) contextualisation versus decontextualisation in relation to the crossreferencing system and (e) space utilisation.As a prerequisite, a brief theoretical background on lemmatisation approaches, traditions and strategies will be presented with special emphasis on the paradigm approach.

2.
A brief theoretical background on lemmatisation approaches, traditions and strategies Prinsloo (2009) distinguishes five aspects of importance for lemmatisation in African languages given in table 1.These aspects are discussed in more detail for nouns in Prinsloo and De Schryver (1999) and for verbs in Prinsloo (1994).
The Sepedi lexicographer has to deal with all of the aspects and subcategories in A to E in table 1.As far as A is concerned the traditional way to compile dictionaries, especially in the pre-corpus era was for the lexicographer to select lemmas on intuition/introspection.The advent of corpora enabled lexicographers to use frequency counts of words in a corpus as a major criterion for the inclusion or omission of lemmas.The paradigm approach could be described as an attempt to physically include all derivations of especially verbs in the dictionary.This is the approach in the Groot Noord-Sotho Woordeboek (GNSW) which will be outlined and evaluated in detail in this article.The rule-orientated approach stands in contrast to the paradigm approach in the sense that the urge is to reduce the number of lemmas presented for a specific paradigm to the absolute minimum.So, for example, only singular forms of nouns are lem-matised and only stem forms (i.e.root + -a), without any extensions, of verbs are included as lemmas.The orthography of the language (B) plays an important role in the choice of the lexicographic tradition (C).A disjunctively written language such as Sepedi will e.g.write the phrase "I love you" as four orthographic words, i.e. ke a go rata and a conjunctively written language such as isiZulu as a single orthographic word, i.e. ngiyakuthanda.Both have exactly the same structure, i.e. subject concord + present tense marker + object concord + verb stem.
Disjunctively written languages such as Sepedi favour the word tradition, i.e. lemmatising nouns with their prefixes while the stem tradition is mostly chosen by lexicographers for conjunctively written languages.So, for example, monna 'man' will be lemmatised according to the word tradition on its full form under M while on its stem -nna under N in a stem dictionary such as GNSW.

The paradigm approach
In lexicography lumping versus splitting are mostly used in the literature in relation to the presentation of different senses of a word.In this article lumping versus splitting will be used in a grammatical sense i.e. grouping different derivations of a specific verbal stem under a single lemma or presenting each of the derivations as main lemmas.This brings the so-called paradigm approach following stem lemmatisation of GNSW in contrast with the traditional word lemmatisation approach of the Pukuntšu dictionaries (PUKU1 and PUKU2).GNSW lemmatises a verb under its stem form and all derivations of the verb including deverbatives will then be lumped together.The different forms will also be lemmatised separately as untreated lemmas with an implicit cross-refer-ence to the main verb stem.
In the paradigm approach in GNSW the basic micro-architecture of an article is designed in terms of a modular layout aimed at bringing together all derivations of e.g. a verb stem.So, for example, the article of the lemma ROBA in GNSW consists of 32 modules distinguished on the basis of derived forms by suffixes and combinations of suffixes.
In module 1 the lemma is the basic stem (root -rob-plus the terminative -a) without any suffixes.The stem is repeated followed by the perfect, passive and perfect plus passive forms.Prinsloo and De Schryver (1999) refer to the latter three as "standard modifications".Modules 2-32 give the root plus a suffix or combination of suffixes with the standard modifications.For example, ROBELANA in Module 21 consists of the root plus applicative suffix (-el-) plus reciprocal http://lexikos.journals.ac.za (-an-) plus the verbal ending followed by the perfect form -rôbêlane, passive -rôbêlanwa, and perfect plus passive -rôbêlanwe.
The module layout includes comments on form and on semantics, mainly giving translation equivalents in Afrikaans and English as well as examples of usage and deverbatives.

4.
Usage versus presumed usage and existence of words in the language Ziervogel (1965) says that the basic meaning of a word lies in its root, e.g. for -roba in -ROB-and by adding a series of pre-and suffixes the root can obtain a variety of senses/meanings which have to do with the basic meaning.
Hoewel die wortel selde 'n selfstandige gebruik in die taal het, dui hy wel altyd 'n begrip aan … Deur die toevoeging van 'n reeks voor-en/of agtervoegsels kan die wortel 'n verskeidenheid begrippe kry wat met die basiese betekenis te doene het.Die voor-en agtervoegsel het wel 'n betekenisinhoud maar nie noodwendig http://lexikos.journals.ac.za 'n ekwivalente betekenis in Afrikaans nie.(Ziervogel 1965: 47) Thus for -roba it means that -ROB-is the ideal point of departure for building a paradigm of derivations by means of affixes in order to reflect the variety of different meanings.In Ziervogel's view this also means that stem lemmatisation is the ideal lemmatisation strategy, e.g. for a systematised representation of word formation.Ziervogel (1965: 45) claims that: Entries must be arranged under their stems with cross-references where necessary.This method is scientifically sound.A systematized survey of word formation in the languages is given; it shows word and lexical relationship and prevents repetition.(Van Wyk 1995: 85) Van Wyk's severe and detailed criticism of the GNSW's approach is focused on the deficiencies of employing a stem lemmatisation strategy instead of a word lemmatisation strategy for a disjunctively written language such as Sepedi.He rejects Ziervogel's claims that stem lemmatisation is scientifically more sound than word lemmatisation, or that it prevents repetition.Of special importance to this article is Van Wyk's statement that it is the task of a grammar book and not a dictionary to give a systematic survey of word formation.
In this article the focus is on presumed aspects of user-unfriendliness in relation to problematic aspects of the presentation and especially the selection of lemmas.
As for the selection of lemmas Ziervogel acknowledges inclusion versus omission of lemmas as being important and problematic and suggests that the written language should be the point of departure for an effort to include all written forms.Die probleem van wat in 'n woordeboek opgeneem moet word, is nogal van belang … Ek glo 'n mens moet begin by die geskrewe taal en alle geskrewe woorde opneem.(Ziervogel 1965: 50) He continues that for a comprehensive dictionary it is important to document the derivations but that the question is to what extent reduplications (repetition of a word with added affixes) and reflexives (i.e.do something to oneself) should be included.
Vir 'n redelik volledige woordeboek is dit tog van belang om aan te teken watter afleidings gemaak kan word.Die vraag is natuurlik vir hoever afleidings soos reduplikasie en refleksiewe opgeneem moet word.(Ziervogel 1965: 52) This brings us to the core of the issue, i.e. what is the duty of the lexicographer in terms of what to include and what to omit from the dictionary.Gouws and Prinsloo (2005: 86) state that in a general dictionary (with a text reception function) "the user should be able to find the words encountered in the day to day general language usage …".The lexicographer should include a selection from the lexical stock of the language.It should not be limited to words found in written texts but also include words from the spoken language.The question, however is whether the lexicographer should invent words in the sense of e.g.derivations that are theoretically possible, i.e., that could possibly be derived in terms of the grammatical rules of the language?In the case of -roba one would have to ask whether all of the derivations given in GNSW are really in use in the language and what the likelihood would be for them to actually be looked up by the target users of the dictionary.
In order to determine the likelihood of the different derivations of -roba to be looked up as well as their actual use in the language, (a) their actual occurrence in the Pretoria Sepedi Corpus (PSC) was determined, (b) the treatment of -roba in Sepedi dictionaries was studied and (c) two mother tongue speakers of Sepedi were requested to indicate which of these forms they know.
The question is thus whether most of these words given by GNSW for -roba actually exist in the language or whether the compilers mainly focused on categorically completing morphological/grammatical paradigms?Does the task of the lexicographer go beyond the lemmatisation and treatment of words actually used in a language to those that can potentially exist because they are e.g.possible through morphological reduplication rules or might come into use as new inventions?As for the latter it is true that one could never claim that a specific reduplication/combination will never come into use.For example, the author once objected to the inclusion of deurgans as a noun 'door goose' in a spelling checker lexicon for Afrikaans only to find that goose door stops exist in English, cf.images at http://vintagepatterns.wikia.com/wiki/Patch_Press_379. http://lexikos.journals.ac.za It is therefore possible that deurgans can become a common word in Afrikaans.As a second example, it would have been unlikely a few years ago to include a nominal like(s) as a lemma in a dictionary but today it is commonly used on websites, e.g.34 likes.
The English language is notoriously fast in adapting to the changing world.New words enter English from every area of life where they represent and describe the changes and developments that take place from day to day.Here are some words and expressions that have been coined in recent years.Some can be found in official dictionaries; others may never make their way there, but new words will continue to appear as the English language adapts to innovations and trends: http://www.learn-english-today.com/new-words/new-words-in-english.htmlConsider the following examples stated: breadcrumbing (a navigation technique which helps users by displaying a list of links to the pages they have visited when exploring a website), copyleft (opposite of copyright … allows freedom of use for all), crowdfunding (raising money for a project by getting a large number of people to make a small financial contribution), cyberbully (a person who uses the Internet to harm another person), textspeak (language used in text messages), etc. http://www.learn-english-today.com/new-words/new-words-in-english.html Be that as it may, it is not the task of the lexicographer to provide for possible future use/existence of words in a dictionary.Lemma selection should not be influenced by words that the lexicographer would like to see as part or to become part of the language.In terms of Wells (1973), Hartmann (1983) and Gove (1961) the duty of the lexicographer is to record language and to include words which are actually used by the speakers of the language in the dictionary.The responsibility of a dictionary is to record the language, not set its style … The only area in which the truth may be found is actual usage.In fine, the function of a dictionary is to reflect the facts of usage as they exist.A dictionary neither permits nor prevents.(Wells 1973: 84) Lexemes become entries in a dictionary only when they are socialised, that is when they are used by a sufficient number of speakers.(Hartmann 1983: 71) The basic aim is nothing less than coverage of the current vocabulary of standard spoken and written English.(Gove 1961: 4a) The lexicographer's attention should be limited to the treatment of existing words in the lexicon especially given the fact that it is hardly possible to cover the existing words, even in a comprehensive multivolume dictionary.Currently available corpora which reflect actual use of words and indicating their frequency of occurrence are the ideal sources to guide the lexicographer in the selection of lemmas.

5.
User-friendliness of the paradigm approach Gouws and Prinsloo (2005: 39) emphasize the importance of the user-perspective: http://lexikos.journals.ac.za The user-perspective, so prevalent in modern-day metalexicography, compels lexicographers to compile their dictionaries according to the needs and research skills of well-defined target user groups.The dominant role of the user has had a definite effect on the compilation of dictionaries as well as on the evaluation of their quality.Good dictionaries do not only display a linguistically sound treatment of a specific selection of lexical items.Good dictionaries are products that can be used as linguistic instruments by their respective target user groups.The better they can be used, the better dictionaries they are.
Bothma and Prinsloo (2013) emphasize that the user may not want to read or browse through a long article with much irrelevant information in terms of his/her specific information need at a given time.In most cases (s)he only requires the information needed to solve the current information need.The lexicographer should therefore guard against excessive offering of information and rather guide the user more directly to the required information.Haas' remark of five decades ago still holds true: A good dictionary is one in which you can find the information you are looking for -preferably in the very first place you look.(Haas 1962: 48) Consider in this regard Prinsloo et al. (2011) where users are guided through decision trees directly to the required information.
An approach to lump information together in long dictionary articles as for ROBA in GNSW runs against the desire to quickly and directly find the information that the user is looking for at a given time.

Evaluation of GNSW in terms of user-friendliness
The GNSW is generally regarded as user-unfriendly.Prinsloo and De Schryver (1999: 258) state that the user-perspective was not seriously considered in the compilation.Their main criticism in this regard is against the use of phonemic sorting on lemmas, and stem lemmatisation.As for the sorting order the compilers of GNSW deviate from an ordinary alphabetical sorting of the entries and utilize a phonemic one, namely: A, B, BJ, D, E, F, FS, FŠ, G, H, HL, I, J, K, KG, KH, L, M, N, NG, NX, NY, O, P, PH, etc., because this is in their opinion 'more scientific'.To the user it is nothing more than sheer frustration to eventually find, for example, a word commencing on bj alphabetically after bu in the dictionary, (Prinsloo and De Schryver 1999: 261).
The layout of the complex article of -roba as given in the appendix is userunfriendly in many ways.First it is very long.Secondly, although the derivations are alphabetically ordered as sublemmas, they are presented in a run-on layout which makes it difficult to detect them as the starting point for most of the 32 modules.Thirdly the use of capital letters to mark them is compromised by the use of the same convention to indicate the derivation from which a specific sublemma was derived.Consider  This way of indicating the source of derivation in a run-on layout thus obscures the capitalised starting point of the modules making it more difficult for the user to find the sublemma.Starting each of the 23 modules on a new line would have substantially increased user-friendliness of the layout.

Inadequate comment on semantics
The predicament of the user however does not end with the difficulty of locating the specific derivation for which (s)he wants to find the meaning.In most cases (s)he will find the specific sublemma with its presumed standard modifications neatly spelled out but without any comments on semantics.The use of actual comment on semantics in the article of ROBA is very limited, especially in relation to the length of the article.So, for example, no comment on semantics is given for the entire stretch of modules 24-32, i.e.ROBIŠANA to ROBOKANYE-TŠANA: This reflects a serious imbalance between comment on form versus comment on semantics which is detrimental to the main reason for looking up words in a dictionary, i.e. to find its meaning.Gouws and Prinsloo (2005: 48) refer to the "main assignment" of linguistic dictionaries "i.e. to give an explanation of the meaning of the lemma in monolingual dictionaries and to provide target language translation equivalents for a source language lemma in bilingual and multilingual dictionaries".It could be argued that the compilers of GNSW were so obsessed to include all possible derived forms that comment on semantics was neglected.Prinsloo and De Schryver (1999: 261) call it an 'enter-them-allsyndrome'.
In the article of phefa the compilers apparently concentrated so hard on completing the modular paradigms that they 'forgot' to give any translation equivalents in Afrikaans and English for the entire article. http://lexikos.journals.ac.za

Efficiency of the medio-structure
The lumping approach in GNSW also reduces the effectiveness of the mediostructure (system of cross-referencing) which is crucial in a lumping approach i.e. to guide the users from a reference position outside the article where the derivation was lemmatised in the alphabetical stretch, to the reference address inside the main article where the derivation in question is treated.Gouws and Prinsloo (2005: 181) state that one of the important functions of the mediostructure of a dictionary is to combat the decontextualisation brought about by alphabetical ordering.In a simplified way one could say that alphabetical ordering of lemmas in a dictionary has the detrimental effect of decontextualizing words that belong together.By way of comparison, words indicating fruit such as apple, pear, banana and orange belong together but are scattered over the dictionary as they belong to different alphabetical stretches.Dictionaries consequently attempt to combat such decontextualisation e.g. by means of a colour plate for fruit given in the back matter or another reference address in the dictionary.In principle the same holds true for what could be termed as grammatical decontextualisation in the sense of different derivations of e.g.-roba such as ithoba, seroba and diroba that will alphabetically be scattered over the dictionary.For -roba this would mean lemmatisation of all derivations in their appropriate alphabetical positions (reference positions) thus decontextualised, to be contextualised by cross-reference to the main lemma -roba and its treatment.
The lumping approach surely brings all these derivations together so that they can be treated together and studied as a grammatical set.Contextualisation is further supported by GNSW lemmatising derived forms separately with implicit reference to ROBA.The article of ROBA in GNSW is followed by no less than 51 derivations entered as untreated lemmas cross-referenced to -roba: robagana v. ROBA robaganedi, se-/di-v.ROBA robaganelo bo-v.ROBA (See the appendix for the complete list) The value of such cross-references for derived forms where all the affixes are suffixes in a stem dictionary is questionable because they all end up alphabetically directly following the article of the lemmatised and treated stem, i.e.ROBA in this case.It does not help the user much if he/she looks up, for example, robagana only to be referred to the article of ROBA directly above where (s)he has to work down through the entire user-unfriendly article layout anyway.A more precise reference address within the article of ROBA, i.e. robagana would have been more helpful.
In terms of space utilisation and especially Ziervogel's (1965: 45) claim that the paradigm lumping approach prevents repetition, Van Wyk (1995: 88, 91-2) has shown in a critical review of this dictionary that in following this approach the compilers did not manage to avoid repetition.In his view they introduced redundancy by having to resort to unnecessary cross-referencing.This brings no gain in economy compared with word dictionaries.The number of entries is the same for both types, the only difference being the structure and the alphabetic classification of the entries.(Van Wyk 1995: 88) It also results in overuse of the medio structure.
Should the lexicographer really wish to include entire paradigms of verbal derivations, a splitting approach would be more user friendly: modules 1-32 would be given as main lemmas, each with treatment and will naturally alphabetically be grouped together anyway.Thus there will not be loss in economy and because they will alphabetically be in close proximity, morphological relations would to a large extent be visible and cross-referencing will be limited.

Conclusion
The GNSW is the most comprehensive dictionary ever compiled for Sepedi and as such remains an invaluable reference source even after four decades -it is a monument for the language.The GNSW scores high marks as a grammar reference source.
Viewed from many other angles however, GNSW is less effective as a dictionary, especially on different aspects pertaining to lemma selection, userfriendliness and comment on semantics.Initial criticism by sources such as van Wyk (1995) and Prinsloo and De Schryver (1999) were aimed at detrimental aspects of alphabetical ordering and the lemmatisation approach.They concluded among others that stem lemmatisation is the wrong option for a disjunctively written language and that a phonemic ordering is highly problematic from a user perspective.
In this article the selection and presentation of the lemmas were critically evaluated.It is highly unlikely that most of the lemmas will be looked for by target users.The lexicographer should not be creative in the sense of inventing words.He remains a recorder of the language and in the words of Phillip Gove (1961) should not attempt to set its style.(S)he should reflect what is real, the real language as used in print and speech, not that which is possible.Precious dictionary space should rather be used to include more words from the living language than artificially created possible reduplications.
The compilers focused on the completion of grammatical modular paradigms to the extent that the actual existence of most lemmas are questionable as supported by a limited user study, corpus evidence and treatment in other Sepedi dictionaries.Comment on semantics, the most important information http://lexikos.journals.ac.za
Treatment in this module includes four nouns which are derived from -robagana i.e. morobagani, barobagani, morobagano and merobagano.The entire article of ROBA consists of 265 nominal and verbal forms of roba:

Table 2 :
Derivations of -roba in Sepedi dictionaries