The Lexicographic Treatment of Quantitative Pronouns in Zulu

Abstract: In Zulu, there are three kinds of quantitatives: inclusive, exclusive and numeral. For the lemmatization of these, even existing traditional dictionaries felt the need to move away from a 
pure 'stem' approach towards a 'word' approach. In a new Zulu–English dictionary project, this is not only confirmed, but is taken one step further with particular attention to the microstructure. 
 
 Keywords: LEXICOGRAPHY, DICTIONARY, BILINGUAL, CORPUS, LEMMATIZATION, FREQUENCY, ZULU (ISIZULU), ENGLISH, QUANTITATIVE PRONOUN, INCLUSIVE QUANTITATIVE 
PRONOUN, EXCLUSIVE QUANTITATIVE PRONOUN, (INCLUSIVE) NUMERAL QUANTITATIVE PRONOUN, USER-FRIENDLY

mas -together with their lemmatized corpus frequencies -constitute the backbone of the dictionary's Zulu macrostructure.As it turns out, each Zulu lemma with a lemmatized corpus frequency of at least 50 needs to be considered for inclusion in the dictionary.
The three largest categories of Zulu parts of speech are -unsurprisingly -the different types of nouns, verbs and adverbs.Together, and in a dictionary that contains the top 5 000 lemmas only, these three POS categories cater for about 80% of the Zulu lexicon.We refer to these three as 'Group 1'.The other POS categories can be divided into two further groups based on the number of members in each POS category.'Group 2' consists of those POS categories with around 100 or (slightly) more members -these are the relatives, adjectives, conjunctions, possessive pronouns, and ideophones.'Group 3' consists of all the rest, thus POS categories with (much) fewer than 100 members each -these include the interjections, enumeratives, demonstrative pronouns, quantitative pronouns, relativized possessive pronouns, locative demonstrative copulatives, absolute pronouns, etc.
A detailed study of the top three POS categories, thus Group 1, will be undertaken in future contributions.Given two POS categories from Group 2 (viz.adjectives, with 126 members, and possessive pronouns, with 99 members) have already been looked into, it is now appropriate to briefly engage with one of the smaller POS categories from Group 3 -in casu the quantitative pronouns, with just 33 members.In dictionary terms, the three categories that will have been discussed so far amount to just 5.16% of the planned total of 5 000 dictionary articles, as illustrated in Figure 1.The possessive pronouns amounted to 1.98% of the total, the adjectives to 2.52%, and the quantitative pronouns are good for 0.66%.There is no reason to believe that there is a correlation between the size of a particular POS category and the lexicographic difficulty of that category.Each POS category deserves a discussion in its own right, and once one will have covered all categories, cross-POS discussions will surely be required for it all to make even more sense.Nor is there a reason to believe that there is a strong correlation between the size of adjectives possessive pronouns quantitative pronouns all other POSs a particular POS category and the summed frequency of its members.The final say on these aspects will only be possible near the end of the project, however.This said, the quantitative pronouns seem not to pose too many lexicographic problems.Yet neither are they trivial.

Zulu quantitative pronouns: A brief linguistic perspective
There are three types of quantitative pronouns in Zulu, all used to express quantity, viz. the inclusive quantitative pronouns, the exclusive quantitative pronouns, and the numeral quantitative pronouns.The inclusive quantitative stem is -nke, which means 'the whole' when referring to singular nouns and 'all' when referring to plural nouns.The exclusive quantitative stem is -dwa, for which the basic meaning is 'alone; only'.For the numeral quantitatives, any of the following adjective stems may be used: -bili 'two', -thathu 'three', -ne 'four', and -hlanu 'five'.The quantitative pronouns are formed as shown in Table 1.
In Table 1, one sees that subject concords (SCs) consisting of a vowel only, change to their semivowel (u-> w-; i-> y-), while a-is dropped.The vowel of the other SCs is elided.The 1st person singular of the inclusive quantitative takes the form of class 1, while class 1 as well as the 1st and 2nd persons sin-gular of the exclusive quantitative is irregular.These four forms, which do not follow the pattern, have been marked in bold.The formation of only one of the numeral quantitatives is illustrated in Table 1, namely with the adjective stem -bili 'two'; the meaning of which becomes 'both'.The formation and meaning of the other three numeral quantitatives is similar.For classes 8 and 10 morphophonological rules apply: the nasal N is m before b (applies to -bili), n elsewhere; the combination n + th becomes nt (applies to -thathu).
The information presented so far is what one typically finds in textbooks and linguistic analyses of Zulu (cf.e.g.Taljaard and Bosch (1993: 83-85), or Poulos and Msimang (1998: 124-129)).Such sources will also list a few more features, some of which will be encountered below.Conversely, corpus evidence reveals other features which none of the existing sources mention.Before these can be discussed, we need to turn from linguistics to lexicography.

Moving from the 'stem pole' to the 'word pole' in lemmatizing quantitative pronouns
As is well known, the lemmatization policy adopted in all existing Zulu dictionaries is to group the lexicon around word stems.So, in a dictionary such as Doke and Vilakazi's (1953) Zulu-English Dictionary, a user is able to look up the six stems -nke, -dwa, -bili, -thathu, -ne and -hlanu.Under both the inclusive and exclusive quantitative stems, this user is even given all the full forms listed in Table 1.Surprisingly, the full forms themselves have also been lemmatized in addition.When it comes to the adjective stems -bili, -thathu, -ne and -hlanu, while not all full quantitative forms are listed within the articles of the stems, a cross-reference to the first (few) of the series is given.Here too, the full forms themselves have been lemmatized in addition.Clearly, then, this is a hybrid approach -one in which Doke and Vilakazi is simultaneously acting as a stem and word dictionary.If the stem and word approaches are viewed as two poles on a continuum (De Schryver 2008a: 86-87), then one could say that Doke and Vilakazi physically move about on this continuum in their dictionary.No doubt, this hybrid approach was followed for reasons of retrievability, or thus user-friendliness.Interestingly, in Dent and Nyembezi's (1995) Scholar's Zulu Dictionary, a dictionary in which an attempt is made to make it easier to find words, the inclusive stem has not been lemmatized, while the exclusive stem, just like the adjective stems, has been.No list of all the forms may be found under the exclusive stem, however, while no further guidance at all with regard to the quantitative pronouns is given under the adjective stems.Yet again, the full forms (or at least, some of them, cf. the Addenda) have been lemmatized.It seems that, in an attempt to lower the threshold, the compilers of this dictionary moved even further away from the 'stem pole'.
Indeed, in a truly user-friendly dictionary there is little point in listing the six stems used in forming quantitative pronouns.This is even true for the adjective stems when used as adjectives (De Schryver 2008a).In a user-friendly dictionary, one thus moves radically away from the 'stem pole', towards the 'word pole'.Given one is dealing with a continuum, the next obvious question is: 'Where to make the cut?' In other words, which formatives and/or prefixes does one keep for lemmatization?Here overall corpus frequencies quickly reveal that the forms as shown in Table 1 are also the ones that are best lemmatized.This will become clear in the discussion below.

Inclusive quantitative pronouns
Of the six quantitative pronouns, the inclusive quantitative pronoun is by far the most frequent, as may be seen from the data in Table 2.The actual breakdown of the inclusive quantitative pronoun has been tabulated in Addendum 1, where the left side of the table summarizes the corpus statistics, and links these to the user-friendly Zulu-English dictionary under construction; while the right side shows the data as seen in Dent and Nyembezi's dictionary.
Clearly, all forms are frequent enough to be included in any user-friendly dictionary, which was consequently also done in both dictionaries.However, given Dent and Nyembezi do not indicate for which classes certain translation equivalents apply, there is considerable room for confusion.For sonke, for example, their equivalents are 'all of us; all of it'.A more user-friendly approach is (1).
In (1), sense numbers are used to present each different class in its own right, while corpus examples illustrate each class.The same applies, mutatus mutandis, to all other classes.Poulos and Msimang (1998: 126) point out that 'it is … not uncommon to hear people say' wonke rather than onke (and likewise for the other class 6 quan-titatives).The use of a corpus enables one (a) to see whether or not this is also reflected in the orthography, (b) if it is, to see how (un)common it really is, and (c) to use the results during corpus-driven dictionary compilation.
The orthographic form wonke occurs a staggering 6 414 times in 8.5 million words, so it is obviously not feasible to read through all concordance lines.What one can easily do is to sample, and this is what is done in Figure 2 where the software (WordSmith Tools, Scott 2008) has been instructed to randomly select every one-hundredth instance only.Rather surprisingly, and extrapolating from the sample, not only does wonke indeed occur in class 6, its frequency is as high as 1 489.The distribution across the different subcorpora (cf. the last column in Figure 2) is also even, with instances in short stories, dramas, newspapers, religious texts, etc.Furthermore, although grammars claim that the basic form for class 6 is onke, given the frequency of onke is 1 659, while that of wonke in this class is 1 489, it is clear that both forms are simply used interchangeably.This new information may be embedded into the respective dictionary articles.Compare (2) and (3). ( onke ** inclusive quantitative pronoun cl.6 Compare wonke ► all ♦ Onke amehlo aphenduka abheka le moto.• All eyes turned around and looked at this car. (3) Note: For class 6, the pronoun 'wonke' also has the variant form 'onke', which is only slightly more frequent in this class.
Note how a cross-reference and a usage note have been used in ( 2) and ( 3) respectively to bring all the information together.Further observe that frequent combinations (wonke uwonke) as well as derivations (esewonke) may all be treated under a single lemma such as wonke.
For the inclusive quantitatives, the latter is the exception rather than the norm, as for four of the nine inclusive quantitatives, the lemma sign is the only member of the paradigm.For the other five, (4) shows all the corpus forms that were brought together -during lemmatization -to obtain the lemma.
(4) Lemma signs with members other than the lemma sign itself The forms without the final vowel are mostly found in poetry (in written Zulu; they are frequent in everyday speech), while those with the diminutive suffix -ana are used for extra emphasis.This leaves the instances in ( 5) to analyze.
(5) Analysis of some of the forms from (4) (with SC = subject concord) esewonke = SC6 in situative mood e-+ auxiliary verb -se + SC6 in situative mood e-+ pronoun wonke = 'if they are now all together' (i.e. the sum/total, e.g. in exam papers)' [cf.derivation under (3)] uwonke = SC3 u-+ pronoun wonke = 'everyone ' [cf. combination under (3)] isiyonke = SC4or9 i-+ auxiliary verb -se + SC4or9 in situative mood i-+ pronoun yonke = 'it is now all/complete' sezizonke = auxiliary verb -se + SC8or10 in situative mood zi-+ pronoun zonke = 'they are now all/complete' kuzozonke = locative prefix ku-+ short form of absolute pronoun 'zona' zo + pronoun zonke = 'at/to/... all of them' If one keeps a perspective on the various frequencies as seen in ( 4), however, then it is clear that the most productive way to lemmatize the inclusive quantitative pronouns is indeed under their basic forms.Low-frequency members of some of the paradigms, then, should only be illustrated when their meanings are lexicalized, as was the case in (3).

Exclusive quantitative pronouns
The corpus statistics and Dent and Nyembezi's treatment for the exclusive quantitative pronouns have been tabulated in Addendum 2. In addition to the data seen there, Dent and Nyembezi also lemmatized two diminutives: yedwana (corpus frequency = 109) and yodwana (43).These are indeed the two most frequent diminutive exclusive quantitative pronouns, but assigning them lemmasign status does not seem warranted.Not tying certain translation equivalents to particular classes is again problematic as well.What they missed outright, and what even Doke and Vilakazi overlooked to cover explicitly, is an extra meaning which corpus data clearly reveals for all singular classes (1, 3, 5, 7, 9 and 11).Compare (6).When a singular exclusive quantitative is preceded by a relative concord (RC) then the meaning becomes '(only) one; (only) a single', as may be seen from the article for lodwa in (6).Lodwa is actually an extreme example, as the frequencies of both elilodwa and olulodwa are higher than that of the lemma sign itself -see ( 7).
(7) lodwa < 2270 > elilodwa <1206>, olulodwa <479>, lodwa <355>, nelilodwa <168>, lilodwa <62> Clearly, then, it is absolutely crucial to use this information during the compilation of articles such as lodwa.During the project, the lexicographers are in the fortunate position to have the data shown in (7) at their disposal.Indeed, for each and every lemma and linked lemmatized corpus frequency, all the members of each paradigm (together with their individual frequencies) are available right there where they need it in TshwaneLex (Joffe et al. 2008), the dictionary writing system used.
For class 6, corpus data indicates that the 'variant form' wodwa (175) is actually slightly more frequent than what grammarians consider to be the basic form, odwa (163).The dictionary articles for odwa and wodwa may therefore be treated in a similar way as seen in ( 2) and (3).
Finally, if one extracts the various structures from each and every exclusive paradigm such as (7), one obtains all the possibilities listed in (8).
(8) SC in indicative or situative mood + excl.pronoun RC + excl.pronoun excl.pronoun + diminutive -ana negative morpheme in indicative mood a-+ negative SC1 in indicative mood ka-+ excl.pronoun locative formative ku-+ RC + excl.pronoun associative formative na-+ RC + excl.pronoun instrumental formative nga-+ excl.pronoun (+ diminutive -ana) If one now considers the frequencies of each of the structures listed in (8) compared to the frequencies of the basic exclusive quantitative pronouns, then the statistics indicate that the latter is more frequent overall.This, then, is also why lemmatization was undertaken around the basic forms.In addition, for the two low-frequent exclusive quantitative pronouns, viz.odwa (163) and nodwa (73), the basic forms are also the only ones in the paradigm.

Numeral quantitative pronouns
The corpus and dictionary facts for the four numeral quantitative pronouns have been summarized in Addenda 3 to 6.Note the dramatic decrease in overall frequency going from 'both' to 'all five', viz. 3 931 → 451 → 174 → 34.
Clearly, and right away, none of the forms listed in Addendum 6 ('all five') qualifies to be lemmatized within the top 5 000 Zulu lemmas, given the minimum lemmatized frequency is 50.Looking at the breakdown in Addendum 5 ('all four'), one concludes that none of these forms qualifies either.This leaves us with only 'both' and 'all three'.For these two numeral quantitative pronouns, one immediately notices that Dent and Nyembezi overlooked to lemmatize the most frequent form in each case!These are bobabili 'both (of them)' (cl.2) and bobathathu 'all three (of them)' (cl.2).This once again confirms why one needs a corpus rather than intuition in order to decide on what to include in and what to omit from a dictionary.As another example, womabili 'both (of them)' (cl.6), has not been lemmatized, while the infrequent yombili has been.Two more points must be considered.Firstly, the frequency of womabili is 375, higher than that of omabili, which has a frequency of 214.Compare in this regard the full 'variant' status of the other class 6 quantitatives (wonke/onke and wodwa/odwa) discussed above.Secondly, yombili, with a frequency of just 18, is 'suspect'.In the corpus, it appears once in a textbook, once in the Bible, three times in novels, and 13 times in newspaper articles.The textbook example is exactly that: a textbook example.It was taken from Doke and Vilakazi's dictionary, imfe yombili 'both pieces of sweet corn', which thus quantifies a class 9 noun, a singular -this while all numeral quantitative pronouns are only by definition supposed to quantify items in the plural classes.Newspaper text is always suspect, and when in a huge text like the Bible only a hapax appears, one again has reason to doubt the status of that particular form.This leaves just three occurrences in novels, too few to make any linguistic claims, and far too few anyway to describe in a dictionary.(For completeness, the forms yomthathu, yomne and yomhlanu (luckily) do not occur in the corpus.)Three of the four class 14 numeral quantitative pronouns do not occur at all, for 'all three', 'all four', and 'all five', while 'both' in class 14 has the lowest frequency of all forms in Addendum 3.Not treating any of these is thus the proper procedure.Corpus statistics further reveal that the so-called variant forms for classes 8 and 10 are actually more frequent than their basic forms: zozimbili (74) vs. zombili (943); zozintathu (14) vs. zontathu (61); zozine (30) vs. zone (30); zozinhlanu (2) vs. zonhlanu (7).This has direct implications for dictionary making, as the cross-reference must go from the lesser frequent to the most frequent form.Compare ( 9) and (10).Perhaps a note on the examples is necessary at this stage.Just as was the case for all other dictionary articles, whenever two different classes need to be exemplified in the main section of a dictionary article, an example for each has been selected.This was done, so that the dictionary is also a didactic tool, which conveys information both explicitly and implicitly.Further observe that the Zulu word for 'bilabial' (obviously) also includes the adjective stem -bili 'two' (undebembili (Cl.1a/2a) < izindebe (Cl.11/10) 'lips' + mbili (Cl.10) 'two').The frequency of undebembili in the general language is however too low to be lemmatized.
Further note that the POS in ( 9) and ( 10) is 'inclusive numeral pronoun', as the numeral quantitatives may indeed be seen as being derived from the inclusive quantitatives (for which the inclusive stem has been dropped) followed by the adjectives.
Lastly, the reason why the numeral quantitatives (or thus the inclusive numeral pronouns) have been lemmatized under their basic forms is simply because each of these forms is also the only member of the paradigm.Nothing else is pre-or suffixed here.

Sinclairian lexicography
Within a Sinclairian, corpus-driven approach to dictionary making, sound lexicographic decisions accompany every step of the compilation; from the use of a corpus for the construction of the macrostructure, which includes decisions on how to lemmatize each and every part of speech, all the way to a detailed analysis of meaning and the presentation thereof in a dictionary.This has been exemplified for the Zulu quantitative pronouns in this contribution.

Figure 1 :
Figure 1: Zulu POS categories studied from a lexicographic point of view

Table 1 :
The formation of the inclusive, exclusive and numeral quantitative pronouns in Zulu (with Cl. = noun class number and 1st and 2nd persons; SC = subject concord; PR = pronominal root; QStem = quantitative stem; AP = adjective prefix; AStem = adjective stem (illustrated for -bili only); N = nasal, i.e. n or m)

Table 2 :
Distribution of the quantitative pronouns (with Freq.= the summed frequency of all (lemmatized) forms) Lokhu kudalela umndeni wonke inkinga.Kuzokwenziwa njani manje?• This caused a problem for the whole family.What is going to happen now? ♦ Wonke umuntu owayelapho wamangala kabi.• Each person who was there was very surprised.2 cl.6 ► all ♦ Amandla wonke asemahlombe kaMnuz Bamba Ndwandwe.• All the authority rests on the shoulders of Mr. Bamba Ndwandwe.3 1p sg ► the whole of me ♦ Sengiyibonile mina wonke.• The whole of me has now seen it.4 2p sg ► the whole of you ♦ Abakithi bangilethele wena wonke ngogqoko.• My friends brought the whole of you to me on a meat tray.