A Critical Analysis of the Lemmatisation of Nouns and Verbs in isiZulu

This article is a critical evaluation of lemmatisation strategies for nouns and verbs in isiZulu with specific attention to the problem of stem identification. The presumed target users of dictionaries compiled according to these lemmatisation strategies are nonmother tongue learners of isiZulu. The advantages versus disadvantages of lemmatising verbal and nominal stems, verbal and nominal stems without suffixes, and nominal words will be considered mainly in terms of the entire paradigm containing the verbal root -sebenzfrom an isiZulu corpus. The conclusion reached is that word lemmatisation is preferred over both stem lemmatisation and lemmatisation of stems without suffixes. It will be argued that the problem of stem identification can only be solved in electronic dictionaries and the electronic dictionary isiZulu.net will be analysed in this regard.


Introduction
The publication of the first dictionary for isiZulu using a word strategy, instead of the traditional device of stem lemmatisation, reopens the debate on stem versus word lemmatisation in African languages.In particular, the question is whether the problem of stem identification -which proved to be the major stumbling block for learners to find lemmas in isiZulu dictionaries -has been solved.To date, most publications on lemmatisation in the African languages contrast disjunctively written languages (e.g., Sepedi, Setswana and Sesotho) with those with a conjunctive orthography (e.g., isiZulu, Siswati and isiXhosa) in order to indicate the advantages and disadvantages of stem as opposed to word lemmatisation.The main argument has been that stem lemmatisation is an accepted, or even the best strategy for conjunctively written languages, but that word lemmatisation is a better option for disjunctively written languages.The principal reason for this is that stem lemmatisation introduces unnecessary problems for the user of a dictionary of a disjunctively written language, especially with regard to the identification of nominal stems.The stem tradition, nevertheless, supported by certain assumptions, such as being the more scientific option gained such momentum that a number of stem dictionaries were compiled for the Sotho languages as well.Word lemmatisation for conjunctively written languages was considered by Van Wyk (1995) and preliminary experiments were conducted at some of the National Lexicography Units in South Africa on the feasibility and possible advantages of word lemmatisation for conjunctively written languages.However, it was only in 2010 with the publication of the Oxford Bilingual School Dictionary: Zulu and English (OZSD) that the almost sacred stem tradition of lemmatisation for an Nguni language was broken using word lemmatisation for an isiZulu dictionary.
The focus of this article differs from earlier research in the sense that first, the issue of stem identification takes centre stage, and secondly, that the advantages and disadvantages of stem versus word lemmatisation are not described in terms of conjunctively versus disjunctively written languages but rather in terms of the advantages and shortcomings of these approaches for the conjunctively written Nguni languages, isiZulu being a case in point.Thirdly, although a selection of examples is offered, the analysis of examples focuses on a paradigm of approximately 2 500 occurrences of different words containing the root -sebenz-'work' in the Pretoria isiZulu Corpus (PZC).
Thus the main aim of this article is to critically evaluate lemmatisation strategies for nouns and verbs in isiZulu with specific attention to the problem of stem identification.The prime objective is to evaluate lemmatisation strategies rather than isiZulu dictionaries per se.For critical reviews of the two prominent isiZulu dictionaries Isichazamazwi sesiZulu (ISZ) and the OZSD, see Masubelele (2007) and Prinsloo (2010), respectively.It should be borne in mind, however, that the choice of lemmatisation strategy may depend on the type of dictionary being compiled and the probable linguistic ability of its intended users.
A consolidation of the most prominent views on stem versus word lemmatisation, which lie scattered over a number of publications, is also attempted.Finally, the success or potential of electronic dictionaries to solve stem identification problems that cannot be solved in printed dictionaries, irrespective of the lemmatisation strategy, is evaluated.

Word forms of -sebenz-in the PZC and a brief explanation of key terms and concepts
One of the objectives of this article is to study the success of the different lemmatisation strategies on an entire paradigm for a randomly selected word and not only, as has traditionally been done in similar discussions on lemmatisation, by quoting examples in a haphazard way as they fit the author's viewpoint.By taking an entire paradigm of real language use of a word and its derivations as input to the study, strengths and especially weaknesses in the different lemmatisation strategies, which may have been overlooked by an idiosyncratic selection of examples, come to the fore.The paradigm of the verb root -sebenzhas been selected on the basis of its high frequency as a verb (-sebenza plus verbal prefixes 4 907, ukusebenza 548 times in the PZC); its frequent occurrence with suffixes, e.g., -sebenzisa (3 373); and also for the high frequency of occurrences of nominal derivations (deverbatives) of -sebenza, e.g., umsebenzi (5 883), emsebenzini (1 456), imisebenzi (1 009), isisebenzi (81) and abasebenzi (174).See the appendix for a list of the forms occurring five times and more in the PZC.The PZC is a raw corpus of approximately six million tokens.
Detailed discussions of the morphological system of isiZulu can be found in grammar books, such as Doke (1945) and in the mini-grammars of dictionaries, such as the Zulu-English Dictionary (ZED) and the OZSD.
In their Zulu-English Dictionary (ZED), Doke and Vilakazi (1948: xxiv-xxv) define stem as "that part of a word depleted of all prefixal inflexions" and root as "the irreducible element of a word; the primitive radical form without prefix, suffix or other inflexion, and not admitting of analysis".
In conjunctively written languages, such as the Nguni languages isiZulu, isi-Ndebele, Siswati and isiXhosa, most word forms (tokens) contain verbal or nominal roots with affixes (prefixes or suffixes, or both) and they are written as one orthographic word.Consider the examples in (1): As a prerequisite to subsequent discussion, a brief but more corpus-based analysis is given with the focus on the chosen paradigm of -sebenz-.The conjunctive way of writing consequently results in very long words; the average word length of isiZulu words (tokens) in the PZC is 6.93 characters, cf.(2a).In disjunctively written languages, such as Sepedi, Setswana, Sesotho, Tshivenda and Xitsonga, nouns, verbs, concords, etc. are written as separate orthographic words, e.g., as in (2b).By contrast, the average word length for Sepedi words in the Pretoria Sepedi Corpus (PSC) is a mere 3.88 characters.
(2) a. Angifuni ukusebenza (isiZulu) 2 linguistic words, 2 orthographic words b.Ga ke rate go šoma (Sepedi) 2 linguistic words, 5 orthographic words 'I do not want to work' A popular definition of lemmatisation is the selection of a canonical form to represent a specific paradigm.A clear though simplified example is that walk is chosen as lemma to represent the paradigm walk, walks, walked, walking.
Stem lemmatisation refers to the selection of the verbal stem -sebenza from verbal forms, such as ukusebenza 'to work, working', usebenza 'he/she works', wawusebenza 'it was working', for instance, as the canonical form for lemmatisation purposes.For nominal forms, the nominal stem -sebenzi is selected for umsebenzi 'work, worker', emsebenzini 'at work', nomsebenzi 'and the work', ngomsebenzi 'with work', etc.In terms of stem lemmatisation a distinction is drawn between stem lemmatisation and left-expanded stem lemmatisation as opposed to word lemmatisation.

The user perspective
As echoed in many publications, e.g., Hartmann (1989), Gouws andPrinsloo (2005a) and(2005b), contemporary lexicography is dominated by a user-driven approach.Consequently, all arguments in this article depart from the users' needs, and/or their reference skills and ability to find lemmas in isiZulu dictionaries.The target users in mind for this discussion on lemmatisation strategies in isiZulu dictionaries are learners of isiZulu with text production needs as well as the need for text reception of the prescribed books for isiZulu readers.
It may be stated at the outset that the inability of users to identify nominal and verbal stems can impede successful word searches or even result in the total failure to look up words in isiZulu dictionaries successfully.The situation is aggravated by the reality in Africa that users generally lack a dictionary culture and dictionary using skills (cf.Gouws and Prinsloo 2005a: 42).
The stem versus the word tradition in lemmatisation Bennett (1986) as quoted by De Schryver (2010: 163) rightfully points to the complexity of nouns and verbs in African languages and asserts that stem identification can be problematic.
There has been debate as to the proper arrangement of the Bantu lexicon, and the question is far from settled.The inflection of nominals and verbals by means of prefixes, and the complex and productive derivational system, both characteristic of Bantu languages, pose difficulties [...] If items are alphabetized by prefix [...] a verb will be listed far from its nominal derivations, however transparent these may be.[...] A competing school arranges the lexicon by stem or root; this usefully groups related items, and saves on cross-referencing.Unfortunately, in such a system the user must be able to identify the stem, which given the sometimes complex morphophonemics of Bantu languages may not be easy.Bennett (1986: 3-4) Van Wyk (1995) puts the issue of stem versus word lemmatisation in perspective in relation to disjunctively versus conjunctively written languages.Van Wyk (1995: 82) notes that two lexical traditions exist in the African languages in South Africa, i.e., the word tradition and the stem tradition: According to the word tradition, lemmas are based on complete written words, and there is a one-to-one correspondence between written words and lemmas.According to the stem tradition, lemmas are based on the stems of written words without their prefixes.
Subsequent publications dealing with problematic aspects of lemmatisation in African languages, such as Prinsloo (1994), Prinsloo and Gouws (1996), Prinsloo and De Schryver (1999) and Prinsloo (2009) have departed from Van Wyk's pioneering analysis.The fundamental issues raised in Van Wyk's 1995 study are not discussed in any detail here -only aspects that are relevant for this article are briefly outlined as a basis for the discussion that follows.
Van Wyk (1995) begins by dismissing the claim that the stem tradition is in any way superior to the word tradition.He states: [...] many lexicographers have come to the erroneous conclusion that only the stem tradition is linguistically justified.Ziervogel […], for example, claims that it is scientifically sound, and Ziervogel and Mokgokong […] state categorically that it is the only scientific method.(Van Wyk 1995: 84).
Then, he refutes the claim that stem lemmatisation is more economical, in terms of dictionary space, than word lemmatisation.Thirdly, he highlights the misconception that verbal affixation and nominal affixation are equally productive and therefore necessitate stem lemmatisation for nouns.Finally, he highlights the problems in respect of stem lemmatisation, especially in the case of some class 9 nouns where neither the lexicographer nor the user can identify the stem.For example, stem identification is very problematic in cases such as intaba 'mountain', intombi 'girl', inkosi 'king' and inkabi 'ox'.The uninitiated user would not know that the stem form of intaba is -ntaba, but for intombi it is -thombi, for inkosi, -khosi and for inkabi, -nkabi in order to look these terms up.Mpungose (1998: 65) agrees by saying that the process of lemmatising nouns in classes 9 and 10 is problematic and he refers to the traditional method as lemmatising the "lexical noun by etymological noun-stem".Mtuze (1992: 17), in reference to nominals of Class 9 and Class 10 in isiXhosa, bluntly states: You never knew how these nominals were lemmatised [...] In some cases, you had to struggle trying to look up words such as ingulube (the wild pig) as the entry could either be under g or under n.
There is, fortunately, no dispute regarding stem versus word lemmatisation in the case of verbs.Followers of both traditions agree that verbs should be lemmatised on their stems.Sources, such as Prinsloo (2009) debating the issue, consequently focus on nominal stem identification as the problematic area.However, it should be emphasized that the fact that both traditions agree on stem lemmatisation for verbs does not alleviate the problem of stem identification.It is argued here that stem identification for verbs in Nguni languages is as problematic for verbs as it is for nouns -the identification of -sebenza from the numerous verbal forms, or -sebenzi from the nominal forms in the paradigm of -sebenzis proof thereof.From the 31 orthographic forms in the appendix occurring more than 100 times in the PCZ, 15 are verbs and 16 are nouns.The challenge to identify the stem is exactly the same for nouns and verbs.
(3) ZED An advantage of stem lemmatisation is that it is the undisputed option for the lemmatisation of verbs, not only for conjunctively written languages but also for the disjunctively written ones.Van Wyk (1995: 85) states clearly that "except for the notational device of a hyphen [...] the entry for 'see' will [...] be found as bona in word dictionaries and as -bona in stem dictionaries".
For nouns, stem dictionaries normally provide the stem and the possible nominal prefixes in brackets, e.g., -sebenzi (um-, aba-, imi-) and the user can rightfully conclude that the forms are umsebenzi, abasebenzi and imisebenzi.The lexicographer must of course make sure that the possible combinations suggested by this notation, i.e., umsebenzi, abasebenzi and imisebenzi are correct.An example where it is not the case is the lemmatisation of inkosi/amakhosi 'king(s)', in Woordeboek Afrikaans-Zoeloe, Zoeloe-Afrikaans (WAZ) in ( 4).
(4) WAZ -khosi, (in-, ama-), b; 1. koning, regent, hoofman The user's conclusion is: *inkhosi 'king', amakhosi 'kings', of which the plural form is correct but the singular form is incorrect -it must be inkosi.This is a very serious mistake, since the dictionary should never guide the user to such incorrect conclusions.
Van Wyk (1995), however, has fundamental problems with stem lemmatisation for nouns.In his view, the first disadvantage stems from the misconception that nominal and verbal stems are equally productive in combining with prefixes.Van Wyk (1995: 87) quite correctly shows that verbs stems may, for instance, productively combine with all the subject concords, object concords, negative morphemes and modal morphemes, 18 x 19 x 6 x 2 which comes to 4 104 possible combinations.Noun stems can only be used with a small number of class prefixes.In the case of nominalizations of -sebenz-, nouns occur in classes 1, 2, 3, 4, 7 and 8 as in (5a).The starred forms in (5a) indicate ungrammatical combinations in terms of the class prefix paradigm for isiZulu.For other nouns, the number of possible combinations can be even less, as indicated in (5b).
( 5 -ntu (umu-, aba-, isi-, ubu-, u(lu)-): umuntu 'a human', abantu 'humans', isintu 'Bantu culture', ubuntu 'humaneness', untu 'common people' -khosi (in-(inkosi), ama-): inkosi 'a king', amakhosi 'kings' -khaya (i(li)-, ama-, um-): ikhaya 'home', amakhaya 'homes', umkhaya 'members of the family' -tho (isi-, izi-, in-, izin-(u(lu)-): isitho 'limb, izitho 'limbs', into 'thing', izinto 'things', utho 'something' -daba (in-, izin-, u(lu)-): indaba 'story', izindaba 'stories', udaba 'a serious affair' Thus Van Wyk (1995) concludes that there is no linguistic justification for treating nouns and verbs in the same way in terms of stem lemmatisation.Van Wyk's criticism is valid if the view is restricted to the consideration of concords in terms of verb stems and class prefixes in relation to noun stems.If, however, the complex orthographic forms of nouns and verbs are considered, e.g., as for -sebenzin the appendix, then noun stems and verbs stems are on a par in terms of productive combination with affixes, such as the conjunctives na, in nomsebenzi (na+umsebenzi) 'and the work', nga in ngomsebenzi (nga+umsebenzi) 'with work' and the possessive sa in somsebenzi (sa+umsebenzi) 'of the work' or with the combinations of affixes.The distinction between verbal stem identification and nominal stem identification therefore effectively falls away for the target users.The question could therefore be asked: if the user has to deal with affixation on such a massive scale anyway in his/her effort to find the lemmas for nouns and verbs in the dictionary, why not depart from the stem in all instances for nouns and verbs?
Identifying the stem remains the underlying challenge for the discussion on the following two lemmatisation strategies: stems lemmatised with their prefixes, and word lemmatisation.
An issue of special relevance for a critical analysis of stem and word lemmatisation in the Nguni languages is the lemmatisation of infinitives.In all of the lemmatisation strategies for isiZulu, i.e., stem, left-expanded and even word lemmatisation, verbs are by default lemmatised as stems.Linguists agree that the infinitive has characteristics of both nouns and verbs.Infinitives such as ukuhamba 'to walk, a/the walking', ukusebenza 'work, a/the working' and ukukhuluma 'speak, a/the speaking' are nouns (of class 15) and verbs at the same time.In traditional grammars the infinitive is therefore positioned and formally described within the two major categories of verb and noun.Consider (6a) in contrast to (6b) where the verbal versus nominal meanings of the infinitive are foregrounded.
(6) a. Angithandi ukuhamba ngezinyawo 'I do not like to walk on foot' b.Ukuhamba kuyakhathaza 'Walking is exhausting' Unlike the other noun classes, the stem of the infinitive noun is not a nominal, but a verbal stem and, unlike verbs, infinitives contain a noun class prefix (class prefix of Class 15).In dictionaries following a stem lemmatisation strategy, such as the ZED, all infinitives are lemmatised under their stem forms, e.g., the lemmas -hamba, -sebenza and -khuluma.No effort towards the lemmatisation of the nominal forms ukuhamba, ukusebenza or ukukhuluma is made in the ZED and no effort towards treating nominal meanings in the articles of -hamba, -sebenza and -khuluma has been made.Ironically, the advocates of stem lemmatisation are forced to lemmatise full words with uku-, e.g., in the case of the ZED for ukuthi (conjunctive) 'so that', ukufa (interjection) 'how magnificent!',ukuba (conjunctive) 'that (after verbs of knowing, etc.)', ukuphela 'only that', etc.These words belong to different parts of speech and the lemmatisation strategy could be justified.The problem lies with the fact that users are conditioned to ignore infinitive prefixes in the process of determining the lemma, i.e., not to consider the ukuand therefore they will look up -thi, -fa, -ba and -phela and indeed find such lemmas with treatment in the ZED without any cross-reference to the lemmatised full forms ukuthi, ukufa, ukuba and ukuphela.
Ukuthi, as a conjunctive or connective, is the most frequently used word in isiZulu and the lack of a cross-reference could simply mean that the users will not find the meaning of ukuthi 'so that, in order that' representing 90% of its use.IsiZulu dictionaries, such as the Compact Zulu Dictionary (CZD) and the English and Zulu Dictionary (EZD) do not handle this issue satisfactorily by either not lemmatising the conjunctive or not giving a cross-reference.(foll.by subjunct.)so that, in order that.

The strengths and shortcomings of stems lemmatised with their prefixes
This is the lemmatisation strategy employed by dictionaries, such as the ISZ, the Concise SiSwati Dictionary (CSD), the Dictionary of the Tebele & Shuna languages (DTS) (the latter as quoted by Gauton (forthcoming)), and even for a disjunctively written language, Sesotho, in the Southern Sotho-English Dictionary (SSED) where stems are lemmatised but the full form of the word is given.In the ISZ, for verbs, it entails lemmatising the stem and presenting it in boldface but adding the infinitive prefix ukuin italics, e.g., ukusebenza will be lemmatised as ukusebenza in the alphabetical stretch S of the dictionary.Likewise, the noun umsebenzi will be lemmatised as umsebenzi under S by giving its full form with the stem in bold and the class prefix in italics as in ( 8). (8) Adding the prefix has certain advantages, e.g., the reassuring factor (in cases of a 1-1 match), e.g., the user wants to look up isidaka and looks for -daka and finds all the different full nominal forms including isidaka.So (s)he knows that the process of information retrieval has been successful.The ISZ goes even further by implicitly giving morphological information about the prefix.Examples, such as ú(lu)sinsi † 'hair growing low on the forehead' í(lí)fasíkoti* 'apron' í(lí)bhoklólo° 'a brave, confident male', contain additional information, i.e., that the full form of the prefixes are, respectively, uluand ili-.Presenting the full word also enables the indication of tone.The symbols ' †', '*' and '°' following the lemma indicate that no plural form exists for ú(lu)sinsi; that í(lí)fasíkoti is a loan word; and that í(lí)bhoklólo is a neologism.Apart from giving the infinitive class prefix with verbal stems, verbal suffixes are given in brackets following the stem, i.e., not as separate lemmas, e.g., úkúlobola ... [-an-, -el-, -ek-, -is-, -w-].Indicating a number of frequently used verbal suffixes in this way does enhance the comment on form but does not contribute to the comment on semantics.For the user, it means that stems containing these suffixes have to be looked up under the basic stem, and the meanings conveyed by these suffixes then have to be added on.It also gives insufficient guidance in cases where sound changes occur as a result of affixation.As Masubelele (2007: 460) rightfully remarks, "variants of words which are the result of phonological processes, such as the passive construction have been omitted, e.g., úkúlobola which changes to úkúlotsholwa in the passive".The same holds true for the passive form of -sebenza where the inexperienced user is unlikely to link the passive form -setshenzwa with -sebenza+w.It would be better to lemmatise and treat derivations containing these suffixes as in (3).Gouws and Prinsloo (2005b: 29) refer to the lemmatisation strategy where stems are lemmatised, but the full form of the words are given as left-expanded article structures.
[...] a left-expanded procedure [...] can [...] accommodate the prefixal element in a slot preceding the stem.This phasing out of the prefixal element to the articleinitial position does not, however, change the status of the lemma sign as guiding element of the article because the lemmatization is still done according to an initial-alphabetical ordering in which the stem is the alphabetical point of reference.
The verb stems -hamba, -hambela, -hambelana, -hambisa and -hambisana are lemmatised with the infinitive class prefix ku-.Full nouns, i.e., sihambi, umhambi and luhambo are given, but they are lemmatised on their stem forms -hambi and -hambo.As in the case of the verbs, the alphabetization is done on H, the first letter of the stem as indicated by the arrows in (9a).Gouws and Prinsloo (2005b) suggest that the first letters of the stems should be vertically aligned to visually strengthen the alphabetical alignment on H in (9b).In the ISZ, vertical alignment on the first letter of the stem has also not been done as in ( 8), but indention of the amount of space equal to approximately three characters and the contrast between italics for prefixes and boldface for lemmas provide for a user-friendly layout and alleviate the need for vertical alignment.
Left-expanded article structures can in principle be extended to go beyond left-expansion of class prefixes to other types of prefixes and prefixal combinations, such as conjunctives and concords.The lexicographer could for example decide to lemmatise words that occur with a high frequency in the corpus, such as nokusebenza (121) 'and to work', ngokusebenza (56) 'by working', wayesebenza (57) 'he/she/it was working' ukuyosebenza (49) 'to go and work' in (10).
( 10) On the one hand, the lemmatisation strategy in the ISZ and the CSD shows characteristics of stem lemmatisation in the sense that alphabetical ordering runs on the first letter of the stem, thus ignoring the different nominal and verbal prefixes.On the other hand, it resembles a word dictionary since full words, i.e., the full infinitive form of, e.g., the verb, kuhamba 'to go' and the full form of the noun sihambi 'visitor/tourist' are lemmatised.An even closer resemblance to word dictionaries is found in the ISZ's layout, e.g., by not putting the infinite prefix in brackets or separating the prefix from the stem by means of a hyphen as was done in the CSD but using the normal orthography.Masubelele (2007: 459) quotes the following paradigm: í(lí)daka 'dry cattle dung', ísídaka 'black soil', ú(lú)daka 'mud', úkúdaka 'to become drunk' and úm(u)daka 'heavy, brown bracelet, bestowed as royal honour'.Advocates of the word tradition for lemmatisation will be quick to point out that no gain in terms of space saving is achieved in such cases.Giving the prefix has certain advantages.First, the reassuring factor should not be underestimated.It is of special value in the case of those class 9 nouns quoted by Van Wyk (1995: 90) where lexicographer and user have difficulty in identifying the stem form.Secondly, utilizing the opportunity to give additional morphological and tonological information is a positive aspect, provided that the user is familiar with the tonal markers.The convention used to indicate the full form of the prefix, however, carries the risk of misinterpretation that, e.g., both usinsi and *ulusinsi; ifasikoti and *ilifasikoti; and ibhoklolo and *ilibhoklolo are grammatical, because this convention normally suggests the part in brackets as being an alternative.Thirdly, the symbols '+' for indicating that the noun does not have a plural form or ' ‡' in the case of certain plurals not having a singular form is a positive.This convention has, however, to be weighed against the convention '(x/y)' as, for example, used in the OZSD where both singular and plural class numbers are given and related and the appropriate class to which the word belongs is indicated by the boldface as in (14).Masubelele (2007: 459) regards the fact that no plural indication is given in the ISZ as a problem, "since only a singular noun prefix is given with each stem, this might be problematic, especially to users who are not mother-tongue speakers, because they might not know what the plural form of the specific word is".The question, however, is whether the user who wants to find the meaning of umsebenzi 'worker' will be interested to know what the plural form is? The problem is rather the amount of knowledge required from the user to look up singular or plural forms in dictionaries employing a left-expanded strategy and whether this strategy contributes in any way to resolving the problem of stem identification.The answer is no, although this lemmatisation strategy provides for user-friendly elements, such as additional information and reassurance, stem identification still has to be done.
Returning to the issue of infinitives: in dictionaries following a left-expanded lemmatisation strategy, the lemmas will (for example) be ukuhamba, ukusebenza and ukukhuluma, respectively.In the case of ukufa 'die, death; how magnificent!' and ukuthi 'say, a/the saying; so that' all three semantic distinctions in each case are accounted for and accommodated together in, say, two subsequent lemmas as in ( 11) and ( 12). ( 11

Advantages and disadvantages of word lemmatisation
The title of De Schryver's (2010) text: Revolutionizing Bantu Lexicography -A Zulu Case Study, suggests that word lemmatisation has fundamentally transformed the lexicography of the African languages.Word lemmatisation for nouns, in word dictionaries where both singular and plural forms are lemmatised means that the full singular and plural forms will be lemmatised with alphabetical sorting on the first letter of the word, e.g., umsebenzi 'work, worker', abasebenzi 'workers' and isisebenzi 'employee', can be looked up directly under u, a and i, respectively.
(13) OZSD The OZSD goes beyond the lemmatisation of the basic nouns umsebenzi, abasebenzi, imisebenzi, isisebenzi and izisebenzi, cf.(5a) above, and offers articles for derived forms, such as ekusebenzeni 'in the working', ekusebenziseni 'in using', ekusetshenzisweni, emisebenzini 'at work' and emsebenzini 'at work'.Word lemmatisation also solves the difficulties mentioned by Mtuze (1992), Van Wyk (1995) and Mpungose (1998); and for those words where it is difficult to identify the stem, e.g., intaba, intombi, inkosi and inkabi.In a dictionary where full nouns are lemmatised, the problem is avoided by the lemmatisation of these forms exactly as they are.So, if the user is given the nominal form, access to the lemma is straightforward and easy.
A typical argument against the lemmatisation of the full forms of nouns as echoed by Van Wyk (1995: 95) is that the alphabetical stretches, especially U, A, I into which nouns fall, will be very large, because nouns in classes 1, 3, 11 and 14 begin with u-, classes 2 and 6 with a-, classes 4, 5, 7, 8, 9 and 10 with i-.Van Wyk's estimate for isiZulu is U: 18%, I:20% and A:5%.In the OZSD, itakes up 62 pages, representing 23.5%, i.e., almost a quarter of the dictionary, u-40 (15.2%) pages and a-14 pages (5.3).However, users are unlikely to find this at all disturbing as can be judged by looking at a typical example taken from the alphabetical stretch isiin ( 14).In the Collins COBUILD English Dictionary (COBUILD), the alphabetical stretch CON-is almost 30 pages long and to the best of our knowledge no complaints have been voiced in this regard.

(14) OZSD
A second argument against the lemmatisation of the full forms of nouns pertains to the lemmatisation of plural forms of nouns, first, in terms of the additional space in the dictionary taken up by these lemmas and, secondly, that it results in overuse of the mediostructure (cross-referencing system), because all such lemmas function as cross-references to the singular forms.It cannot be denied that lemmatising plural forms takes up a great deal of additional space.However, in terms of the reassuring aspect mentioned above as well as the amount of information carried by these skeleton dictionary articles as in ( 15), their inclusion in the macrostructure could be justified.First, the user is reassured that (s)he is dealing with the correct lemma; secondly, information on the frequency of use is indicated (by means of e.g.*, ** and ***); thirdly, noun class information is provided; and finally a cross-reference is given to the singular form where full treatment is offered.
(15) OZSD Consider now the presumed or likely dictionary needs of a learner of isiZulu who wants to use the dictionary for text production.First, a typical situation in class is considered where the learner is instructed to find the meaning of a number of isiZulu words, say nouns, or to write an essay on abasebenzi 'workers'.In the latter case, the users simply take the dictionary, look for the lemma abasebenzi and the worst that can happen is of (s)he having to follow up a crossreference to the singular form umsebenzi where appropriate treatment is offered.No problematic stem identification is required as in the case of stem or left-expanded lemmatisation discussed in the previous paragraphs.However, a substantial part of learners' needs is to find the meanings of words used in their prescribed books, especially isiZulu literary works, such as novels, poetry and prose.
This means they are from the outset confronted by the full/complex orthographic forms which can be more than 2 000, e.g., in the case of -sebenza.In order to find the meaning of nomsebenzi (309), ngomsebenzi (253), emsebenzini (1 456), somsebenzi (112), etc. (s)he has to identify the noun.This has to be done principally by stripping off affixes.Even more problematic is where (s)he has to add characters to the word in order to reconstruct the full noun in order to look it up, e.g., umsebenzi for msebenzi (470).Adding or stripping affixes in order to find the word form for the word search is as challenging to the user as is stem identification.This unfortunately means that the problem of stem identification is simply replaced by the challenge of identifying word forms.Advocates of the stem tradition could argue that, if identification of the lemma entails the selection of a section of the complex orthographic word anyway, why not then also cut the noun prefixes, which brings one back to stem lemmatisation?It is not possible to lemmatise the entire paradigms of all isiZulu words in printed dictionaries.Prinsloo (2010) tries to make a case for selection on the basis of frequency in this regard by saying that the lexicographer should ensure that the frequently used forms are included.The OZSD indeed lemmatises quite a number of frequently used derivations of -sebenz -, i.e., abasebenzi, ekusebenzeni, ekusebenziseni, ekusetshenzisweni, emisebenzini, emsebenzini, imisebenzi, isisebenzi, izisebenzi, -sebenzela, -sebenzisa, -sebenzisana, -setshenziswa, -setshenzwa, ukusebenza, ukusetshenziswa, umsebenzi.This is useful, but for the learner reading an isiZulu novel, the low frequency words also need to be decoded for him/her to understand the specific utterance.
Returning to the infinitive, dictionaries following word lemmatisation will, by default, also at least have the lemmas -hamba, -sebenza and -khuluma, honouring the non-disputed stem lemmatisation approach for verbs, but will lemmatise infinitives as nouns according to the default word lemmatisation strategy for nouns, i.e., on the first letter of full forms.The OZSD accordingly lemmatises the full forms of a number of frequently used infinitive nouns, such as ukudla 'food', ukuhamba 'departure' and ukukhuluma 'a/the talking' in the alphabetical stretch ukuand treats them appropriately for their nominal meanings.These infinitive nouns, however, stand in contrast with the infinitive verbs ukudla 'to eat', ukuhamba 'to walk/go' and ukukhuluma 'to speak' in isiZulu.Ukudla, ukuhamba and ukukhuluma have, therefore, also been lemmatised in the OZSD under their stem forms -dla, -hamba and -khuluma as well with applicable treatment for their verbal meanings.However, as argued above, dictionary users become used to looking up infinitive verbs under their stem forms.When looking up ukudla, ukuhamba and ukukhuluma, the user is unlikely to consider the possibility that (s)he should also check under ukufor the possible existence of an infinitive noun with the same stem.As in the case of lemmatising stems, a cross-reference in this case is imperative from the articles of the verb stems to the full nouns in such cases.In many instances the nominal and verbal meanings are closely related, e.g., ukukhuluma: 'to talk, a/the talking', ukuhamba 'to travel, a/the travelling', ukusebenza 'to work; a/the working', but in cases, such as the infinitive noun ukujula 'depth', the infinitive verb stem, i.e., -jula, means 'consider carefully'.A cross-reference from -jula to ukujula is imperative to avoid misguiding the user.Inserting such cross-references would of course require additional space in the dictionary.In addition, users should be alerted in the user's guide to the dictionary to check for possible nominal forms when looking up infinitives under their verbal stems and vice versa.
There should be no doubt that word lemmatisation contributes substantially to reducing the problems that stem lemmatisation causes to users.However, the problem of stem identification or word identification is still not solved, and probably never will be solved in printed dictionaries and that moves the focus to electronic dictionaries.

Electronic dictionaries for isiZulu -the final frontiers?
In the early nineties, the electronic era was met with great enthusiasm and expectations expressed in relation to electronic dictionaries and their enormous potential to supersede printed or paper dictionaries in imaginative ways.As the title Lexicographers' Dreams in the Electronic-Dictionary Age of De Schryver (2003) suggests, early publications on EDs were dreams about the potential of the new medium and the expected revolution it would bring along, such as antiquating the paper dictionary in a decade or two.These publications list dozens of advantages of EDs, such as accessibility, user-friendliness and especially the availability of space and processing speed.Many of these issues are discussed in detail by Dodd (1989), Bolinger (1990), Atkins (1996), Nesi (1999), Geeraerts (2000), Harley (2000) and Prinsloo (2001), to name but a few.Meijs (1990) even predicted the end of the paper dictionary by 2000.Prinsloo (2005) believes that the potential of electronic dictionaries lies in the utilization of what he calls true electronic features, such as pop-up access, bringing together of related items, new routes to the data, less dependency on alphabetical order, fuzzy spelling, intelligent extrapolation of characters keyed in and audible pronunciation.For the purposes of this article, the question is what the status of currently available isiZulu dictionaries is in terms of lemmatisation and solving the issue of stem or word identification that dominate the discussion in the previous sections of this article.
In principle, catering for all of the approximately 2 500 occurrences of -sebenzin the PZC is not a problem in electronic dictionaries, given the almost unlimited available space and the speed of information retrieval, cf.Prinsloo (2001) andDe Schryver (2003).The question, however, is whether this goal has been achieved.
A number of electronic dictionaries and word lists are available for isiZulu, such as the Webster's Online Dictionary, Freelang.netand the Dicts.info.However, the most sophisticated online dictionary is the isiZulu.net.The major stongpoints of this dictionary are that it is extensive; there is no need for stem search; and it automatically gives a morphological analysis of the stem plus affixes.The isiZulu.net offers some promising features in solving the most problematic cases discussed in terms of Van Wyk (1995) above where stem identification is problematic.To illustrate: it offers two access routes to impilo, i.e., impilo and mpilo, and both intombi and ntombi for intombi.Plural forms of these nouns can also be directly looked up by typing their full forms, izimpilo and izintombi, respectively.In addition, this dictionary is useful in cases where the learner finds it difficult to isolate stems/words.
From the examples given in ( 16), it is clear that the stem identification problem has at last been resolved.The inexperienced learner can simply type in the word or part of it and is (re)routed to the appropriate lemma.Moreover, the quality of the treatment is good. (16) The question, however, is how comprehensive the isiZulu.net is in terms of coverage of entire paradigms of words, such as the paradigm for -sebenzas given in the appendix?Formulated differently, can all orthographic forms of nouns and verbs in isiZulu be looked up in the isiZulu.net?To answer these questions, the isiZulu.netwas subjected to a number of random tests in terms of the paradigm for -sebenz-, as well as to random selections from a number of published isiZulu dictionaries.
For the inexperienced user, the automatic guidance from msebenzi to umsebenzi, abasebenzi and imisebenzi is excellent because, as mentioned above, no addition of characters to the word is required to look it up. (17) In the case of the successful automatic retrieval of umsebenzi from lomsebenzi (lo+umsebenzi), nomsebenzi (na+umsebenzi) and ngomsebenzi (nga+umsebenzi), the results are equally satisfying, because the search was successful and morpho-phonological processes are reversed by the dictionary and presented to the user in a clear and user-friendly way.
For the second test, ten words were selected from the paradigm for -sebenzthat occur five times in the PZC as given in Table 1 and even though these words occur with a low frequency, six were found in the isiZulu.net.18) reflects the quality of treatment for ngingasebenzi in the isiZulu.net. (18) First, decomposition of ngingasebenzi to -sebenza gives a useful morphological breakdown into stem with prefixes.The user learns that (s)he is dealing with a derivation of -sebenza.Secondly, a translation equivalent of the full word ngingasebenzi, 'I not work' (even though the latter is a very direct translation) is given.So, for this example one could say that the problem of stem/word identification has been resolved and the user finds sufficient comment on form as well as comment on semantics of the full word.
For the third test, a study of the first lemma on every 25th page of the WAZ (nine lemmas) was done in terms of its inclusion or omission from the isiZulu.netand its presence in the PZC.It was found that five of the nine lemmas occurred in the PZC.Only three were lemmatised and treated in the isiZulu.net.A similar selection of the first lemma on every 50th page of the ZED and its inclusion or omission from the isiZulu.netand the PZC revealed that from the 19 lemmas in question seven occurred in the PZC and three in the isiZulu.net.From these three tests, it is clear that the isiZulu.netelectronic dictionary performs well on the more frequently used words but substantial enlargement will be required to cover less frequently used words as well.
The least electronic dictionaries could do is to link paradigms, such as those in the appendix to the stem/word, i.e., -sebenza in this case.

Conclusion
The weakest option for lemmatising nouns and verbs in isiZulu is to lemmatise verbal stems without suffixes and in the case of nouns, noun stems without their prefixes and without the augmentative and diminutive nominal suffixes.This lemmatisation strategy is not user-friendly; stem identification is a major obstacle; a vast amount of knowledge of morphophonetics is presupposed; and the user is often in doubt whether (s)he has successfully retrieved information.Even if the users do manage to identify the stem and to look it up, all the additional information conveyed by the affixes have to be 'added back on' and the user will not know for sure whether (s)he has come to the right conclusion.Lemmatising verb stems represents a slight improvement.At least the meanings of the suffixes need not be artificially added on as in the case of lemmatising stems without their suffixes.
Lemmatising stems with their prefixes merely added on (left-expanded) is a better option, because the user has the advantage of seeing the full form of infinitive verbs and the full forms of nouns with additional information, such as tonal indication.This strategy is more user-friendly, but stem identification remains problematic and a substantial amount of knowledge of morphophonetics is still presupposed.
Word lemmatisation applicable to nouns is by far the better strategy, because nouns can be looked up under the first letter.For given non-derived nominal forms, the problem of stem identification is solved for all nouns.This strategy is especially beneficial for those nouns where stem identification is http://lexikos.journals.ac.za doi: 10.5788/21-1-42 problematic.The strategy is user-friendly and no knowledge of the grammar is presupposed.However, for nominal and verbal derivations, especially those where nominal and verbal stems occur with multiple prefixes, the problem of stem/word identification remains unsolved.
The problem of word/stem identification which is present in all of the lemmatisation strategies employed for isiZulu can only be solved in electronic dictionaries.Most electronic dictionaries are mere translated word lists and are not of much use to the target users especially for their productive needs.A clear exception is the isiZulu.netonline dictionary, where the problem of stem/word identification has been solved for most of the frequently used words in isiZulu, but more comprehensive electronic isiZulu dictionaries are required to alleviate the need for stem/word identification for less frequently used words as well.

Table 1 :
A random selection of derivations of -sebenzoccurring 5 times in the PZC and their presence or absence in isiZulu.net