A Computational Approach to Zulu Verb Morphology within the Context of Lexical Semantics

The central research question that is addressed in this article is: How can ZulMorph, a finite state morphological analyser for Zulu, be employed to add value to Zulu lexical semantics with specific reference to Zulu verbs? The verb is the most complex word category in Zulu. Due to the agglutinative nature of Zulu morphology, limited information can be computationally extracted from running Zulu text without the support of sufficiently reliable computational morphological analysis by means of which the essential meanings of, amongst others, verbs can be exposed. In this article we describe a corpus-based approach to adding the English meaning to Zulu extended verb roots, thereby enhancing ZulMorph as a lexical knowledge base.


Introduction
The integral role of the Internet and the world-wide web in facilitating the production and consumption of enormous amounts of information in digital space depends on the ability of computers to perform a wide variety of tasks involving human language.This requires, amongst others, computational approaches to representing and understanding world knowledge on the one hand, and knowledge about human language in machine-processable form on the other hand.Central to this endeavour is the notion of meaning or semantics, and more specifically lexical semantics, generally defined as the linguistic study of the meaning of individual words, and the meaning-related connections between words.Moreover, contemporary research in lexical semantics as such also relies on natural language processing (NLP) for a wide range of computational approaches and on large electronic corpora that have "revolutionized the possibilities of investigating usage patterns in real language across genres and cultures and further develop probabilistic usage-based ideas."(Paradis 2012).Typical computational lexical semantics tasks include word sense disambiguation in context, computing word similarity and word relatedness, as well as other relations between words, and semantic role labelling (Jurafsky and Martin 2009).In turn, NLP applications such as machine translation, question answering, information retrieval, information extraction, text classification and multilingual conversational agents, to name but a few, rely on these basic tasks in realising a digital space in which the users of diverse languages can participate in cross-lingual knowledge production and consumption.Performing computational lexical semantics tasks across languages brings the added complexity of requiring access to NLP support in multiple languages.For underresourced languages it has become common practice to use a well-resourced language such as English as a type of pivot language for providing word meaning and cross-lingual lexical semantics.
Lexical semantic knowledge has up to now been captured mainly through two approaches."The first is the knowledge-based approach, in which human linguistic knowledge is encoded directly in a structured form, resulting in various types of lexical knowledge bases.The second is the corpus-based approach, in which lexical semantic knowledge is learnt from corpora and then represented in either explicit or implicit manners."(Gurevych et al. 2016: xiii).Broadly speaking, lexical knowledge bases are knowledge bases that provide lexical information about words of a particular language.
In this article the focus is on Zulu, an official language of South Africa, which is, amongst others, characterised by its rich agglutinative morphology in which the verb is the most complex word category.In spite of its official status, Zulu is considered an under-resourced language.When dealing with underresourced languages, it is common practice to use as much of the available language data and resources as possible.For this reason, both kinds of approaches to lexical semantic knowledge are employed: hand-crafted expert linguistic/ lexical knowledge in machine-processable form as well as growing volumes of electronic Zulu text corpora.There is also a deeper linguistic justification for employing these two complementary approaches: The first exploits the regularity of linguistic structure -in our case the basic morphological structure and the so-called predictable meanings associated with morphemes, in this case the verb extensions.The second caters for the irregularities, the idiosyncrasies that occur in all languages, and for the "unpredictable" lexicalised meaning of extended verb roots.More specifically, we show how ZulMorph, a comprehensive hand-crafted finite state morphological analyser for Zulu, and the South African Constitution (SAC), a small electronically available parallel English-Zulu corpus which is an official document of the highest order, translated into all official languages, can contribute to Zulu lexical semantics with English as pivot language.

Basic approach
A lexical knowledge base (LKB) is a digital knowledge base "that provides lexical information about words" (Gurevych et al. 2016).Conceptually the most basic unit or entry in a lexical knowledge base is the so-called (lemma 1 , meaning) pair 2 .While our ultimate aim is to construct such pairs for all the words of Zulu, nouns and verbs are specifically important since they play a central role in knowledge representation -nouns usually name concepts about which information is represented and verbs often express relationships between concepts.Moreover, verbs are the morphologically most complex word category in Zulu.
The verb in Bantu languages, in general, incorporates a great deal of information, to the extent that it may even stand alone as a sentence.It is for this reason that we focus on the latter word category in this article.
We propose a computational approach based on ZulMorph.As a comprehensive hand-crafted finite state morphological analyser, ZulMorph not only contains lemmas of most Zulu words, based on various paper dictionaries, other language resources and text books for Zulu (Pretorius and Bosch 2003;Bosch and Pretorius 2006), but it is also arguably the most complete model of the morphological structure of Zulu words.So, when presented with a valid Zulu word, it provides the lemma as part of the full morphological analysis of the word.What ZulMorph does not yet provide, is the meaning of the lemma.
Representing the meaning, also often referred to as the sense, of a lemma is well-known to be hard (see, for example, Faruqui 2016) and has been studied extensively for a language such as English, generally considered a well-studied and digitally well-resourced language.Jurafsky and Martin (2009) provide an excellent introduction to and overview of computational approaches to the representation of word meaning and word sense in English.Therefore, since computational word meaning representation approaches and resources (Lazaridou et al. 2013) for Zulu are not readily available, we propose a cross-lingual approach with English as pivot language for providing the meaning of a Zulu lemma.More specifically, we enhance ZulMorph to output a lemma, as well as its English translation equivalent as the meaning of the lemma.Endowed with this added http://lexikos.journals.ac.za capability, we then propose that ZulMorph, as basic Zulu LKB, would enable the user to rely on the rich computational infrastructure of English word meaning representation in further processing and applications.
The structure of the article is as follows: Section 2 outlines the approach followed to address the stated problem.Section 3 provides a brief overview of Zulu verb morphology with specific reference to verb extensions, their complexity, their predictability of meaning and related lexicalisation issues.We specifically emphasise morphological (lemma) and semantic (meaning) challenges.In Section 4 ZulMorph is presented as an approach to lemmatisation.As before, the focus is on verbs, their roots and their extensions.In Section 5 the hand-crafting of a basic Zulu LKB from existing paper dictionaries and grammar texts as a "snapshot" of Zulu lexical semantic information is presented.In section 6 the focus is on a corpus-based approach to semi-automatically extracting new verb roots, new extensions and new lexicalised meanings 3 for the possible addition to the ZulMorph-based LKB.Section 7 concludes the article and provides suggestions for future work.

Zulu verb morphology
The morphological composition of the verb is considerably more complex than that of any other word category in Zulu.A number of slots, preceding and also following the verb root, may contain numerous morphemes with functions such as derivation, inflection for tense-aspect and marking of nominal arguments.
Examples are cross-reference of the subject and object by means of class-(or person-/number-) specific markers, locative affixes, morphemes distinguishing verb forms in clause-final and non-final position, negation morphemes and so forth.In this article we concentrate on the so-called verb extension morphemes (Poulos and Msimang 1998: 183-207).As is the case with most Bantu languages, the complex verb morphology of Zulu is characterised by the use of so-called verb extensions to extend or adapt the meaning of a particular verb.By means of a verb extension or a combination of extensions "definite variations of meaning are derived, variations which in English can only be made by the use of auxiliary verbs, adverbs or prepositions."(Doke 1973: 135).
In the inflectional morphology of Zulu the basic meaning of a verb root in Zulu may therefore be modified by suffixing one or more extension morphemes to the verb root 4 , e.g.It is significant that the verb root -phind-may use 22 different combinations of verb extensions of which 6 feature as headwords in the Zulu-English Dictionary (ZED) (1964: 662-663).In the outer matter (ZED 1964: ix), it is indicated that separate entries have been made for "verbal derivatives" (extended verb stems) that "convey some meaning or idiomatic usage not deducible from the inherent significance of the derivative form", e.g.
(2a) -hamb-a 'travel, move along' (2b) -hamb-el-a 'visit, be on good terms with' In other cases, where the "inherent significance of the derivative form" is easily deducible from the basic verb stem, the derivative forms are listed in brackets after the entry of the basic form, e.g.
(3) -pikiz-a 'wriggle about, waggle ' (pass. -pikizwa; ap. -pikizela; caus. -pikizisa) According to Wilkes (1971: 261) there is theoretically no limit to the number of verb extensions that may be suffixed to a verb root.However, the database of over 6000 examples collected for his study (Wilkes op. cit.) contained very few examples with more than three verb extensions being used simultaneously.
In summary, verb extensions are a key feature of Zulu verbs and their meanings and have to be accounted for in a LKB for Zulu, both in terms of the easily deducible meanings and also the lexicalised and idiomatic usage.

Morphological challenges
Within a rule-based approach to morphology, the following are examples of morphological challenges (morphotactics and morphophonological alternation rules) that are encountered with regard to verb extensions: (a) Some basic verb roots resemble extended verb roots, e.g. the verb root -hlangan-'come together; unite; connect' in which the morpheme -an-resembles the reciprocal extension.In this case it is not an extension but part of the verb root.
(b) Rule-based palatalisation occurs in the formation of passives when the final syllable of a verb root begins with a bilabial consonant, also when such a verb root is separated from the passive extension -w-by another extension, e.g.Occasionally however, idiosyncrasies occur when bilabials appearing elsewhere in the verb root are palatalised, e.g.

(6b) -akh-el-w-an-a
'be built for each other ' -verb.root-appl.ext-pass.ext-recip.ext -terminative (cf. Van Eeden 1956: 657) It should be noted that we do not deal separately with verb roots that end in -kand -l-and which are subject to varying modifications in the formation of the causative (e.g. -vuk-is-a > vu-s-a; -vel-is-a > -ve-z-a).The reason is that such extended roots are lemmatised as such in most dictionaries, e.g.Dent and Nyembezi (1969: 506-507) contains the entries -vuka (v) 'wake up; rise up ' and -vusa (v) 'awaken; rouse up; warn against danger; lift up'.

Semantic challenges
Whereas the basic meaning of verb roots is easily accessible from existing dictionaries, the semantic challenge lies in the extended or lexicalised meanings that come about when the verb root is extended by means of a variety and combination of verb extensions.In most grammatical descriptions of the Bantu languages, verb extensions are considered to be inflectional suffixes since "they do not change the word category to which a word belongs, but add a regular, predictable meaning to the word" (Kosch 2006: 109).The predictable meanings of extended verb roots can be summarised as in The applicative extension is also used to indicate "in a direction" when followed by a noun indicating location, e.g. ( 9) -gijim-el-a ezintabeni -verb.root-appl.ext-terminative'seek shelter in the mountains' An interesting case is found with the meanings of the verbs -khohla 'escape from the memory, slip from the memory' and -khola 'satisfy, have confidence in', in the sense that they are unexpectedly used as transitive verbs in the passive, e.g. ( 10) -khohl-w-a 'forget, overlook' -khol-w-a 'be satisfied, believe in' The predictable versus lexicalised meaning phenomenon has been considered from various perspectives that are important for our computational approach to the lexical semantics of Zulu verbs.
On the one hand, the predictable nature of meaning has been documented and provides justification for us to include such regularity in our computational model of Zulu verbal lexical semantics through the "standard" (rulebased) semantic annotation of verb extensions in ZulMorph.According to Wilkes (1971: 50-51) the adding of a verb extension in Zulu does not imply a radical modification of the lexical-semantic aspect of a verb since this remains basically the same.The modification that takes place is that of the manner in which a process progresses or is executed, while the nature of the process remains unchanged.In cases of combinations of verb extensions being suffixed to a verb, it is only the first suffix after the basic verb root that modifies it.Each of the following extensions in turn modifies the foregoing modification (extended root).This modification process is demonstrated in Figure 1.We return to this sequencing of extensions and their "composite" meanings in Section 5.1.
On the other hand, Chabata (1998: 146) points out that verb extensions in the Bantu language Shona are considered to be derivational morphemes and not inflectional morphemes, one of the reasons being that "they usually change the meanings of the verb roots in question in highly significant ways".This suggests that there is good reason to also make provision in our Zulu LKB for verb extensions to have "unpredictable" lexicalised meanings.These are not systematic and cannot be captured by means of rules.They have to be found individually mainly through corpus-based approaches and added to the LKB as part of its maintenance and continued enhancement.

Computational Zulu verb morphology and lemmatisation
Before providing the essential details of ZulMorph as the basis for a basic Zulu LKB, we develop the core notion of Zulu word sense pair, in this case for verbs.

What is the lemma and word sense pair of a Zulu verb?
We start by illustrating by means of an example what a word sense pair -a (lemma, meaning) pair in English is.We then use this to explicate the notion of Zulu word sense pair -a word sense pair in which the lemma is in Zulu and its meaning is the English translation equivalent 6 .
Example 1: 'travels' is a word in the English sentence 'He travels to Johannesburg.'The appropriate meaning of 'travels', according to the Princeton Word-Net 7 , is "undertake a journey or trip".The lemma of 'travels' is 'travel' 8 and therefore the English word sense pair is (travel, undertake a journey or trip).
But what constitutes the lemma and the Zulu word sense pair of a Zulu verb?Four important aspects have to be addressed: (i) Lemmatisation via morphological analysis: a standard approach to lemmatisation is through computational morphological analysis (Jurafsky and Martin 2009: 645).For Zulu, the complex agglutinative morphologi- 'let see for', 'show for' 'let see for each other', 'show for each other' http://lexikos.journals.ac.za cal structure of a Zulu verb includes, amongst others, the verb root and its verb extensions.For the purposes of Zulu verbal lexical semantics, the verb root together with its extensions, i.e. the extended verb root, constitutes the lemma of the Zulu verb.This decision is based on the insight that the lexical semantics of the Zulu verb is determined by the verb root AND its extensions since, as we have seen in Section 3, the extensions are meaning changing suffixes to the root.This aspect is addressed in Section 4.3; (ii) Assigning a meaning in the form of its English translation equivalent to the verb root.This aspect is addressed in Section 4.4; (iii) Assigning English meaning(s) to the verb extensions so that they can be combined (composed) with the meaning of the verb root.This aspect is addressed in Section 4.4; (iv) Combining the information in (ii) and (iii) to yield a lemma and a word sense pair for any given Zulu verb in which the meaning is provided as the English translation equivalent of the Zulu lemma, English being our pivot language.
But how is this composite meaning of the Zulu lemma, as defined in (i), obtained?
In Table 1 of Section 3.2 the predictable meanings of the respective verb extensions are given and the question now arises as to how a sequence of meanings is combined into one meaning for the extended verb root as Zulu lemma.To answer this question we rely on the left-associative compositional nature of the meaning of the verb root and its sequence of extensions, as already documented by Wilkes (1971) (see Section 3.2 and Figure 1).We illustrate this by means of Example 3.Although we primarily base our modelling of the "composite" meaning on the predictable meaning of extensions, we also attend to lexicalised meaning where relevant.
Example 2: Consider the word uyahamba 'he travels'.Through morphological analysis (see example (2a)) we obtain the lemma hamb 9 .The appropriate word sense pair is (hamb, travel).In further applications 10 the lemma hamb may then be linked to the original Princeton WordNet sense "undertake a journey or trip" via the English translation equivalent 'travel'.For the words uyahambisa and uyahambela the word sense pairs are (hambis, cause to travel) and (hambel, travel on behalf of/travel towards) or (hambel, visit) if the lexicalised meaning of (2b) is used.Similarly linking to the Princeton WordNet could further yield (hambis, cause to undertake a journey or trip) and (hambel, undertake a journey or trip on behalf of) or (hambel, go to certain places as for sightseeing).
Example 3: For the word uyahambelisa the lemma is hambelis.Its meaning is obtained by composing the respective meanings from the left, as shown by means of the bracketed representation of the lemma (((hamb)el)is) : 'cause to' http://lexikos.journals.ac.za (meaning of hambel) => 'cause to travel on behalf of' , 'travel towards' or 'visit' if the lexicalised meaning is used.This yields the Zulu word sense pair (hambelis, cause to travel on behalf of/travel towards) or (hambelis, cause to visit).As before, this may then be further expanded via the Princeton WordNet to (hambelis, cause to undertake a journey or trip on behalf of) or (hambelis, cause to go to certain places as for sightseeing).
For any verb in Zulu, we are now able to conceptually provide its Zulu word sense pair.In subsequent sections we show how this lexical semantic information is computationally obtained and encoded in ZulMorph as basic Zulu LKB.

ZulMorph
ZulMorph It is well-known that the coverage of a finite state morphological analyser such as ZulMorph is determined by (i) the accurate and complete modelling of the morphological structure of the language, and (ii) the comprehensiveness of the noun stem and verb root lexicons.Only valid Zulu words, of which the noun stems or verb roots are present in the respective lexicons, can be analysed correctly.For such a morphological analyser to be maximally useful, these stem and root lexicons need to be maintained and extended as new words enter the language.This remains ongoing work.
In principle, the cascading continuation classes of morpheme lexicons model the filling of slots in the morphological structure of the verb.However, the slots that we are interested in here are those for the verb root and its extensions, since together these constitute the lemma.While the order of the verbal prefixes is fixed (cf.Poulos and Msimang 1998: 305), this is not the case for the extensions.There is no fixed order or number since these are semantically determined.Indeed, as discussed in Section 2, the various verb extensions are not compatible with all verb roots, and there are no hard and fast rules that determine the possible combinations, i.e. roots with extensions, as well as extensions with one another.Comprehensive information on these combinations is not available -not even paper dictionaries provide complete information on combinations and sequences for all verb roots.The inclusion of such http://lexikos.journals.ac.za "idiosyncratic" information about verb roots and their (semantically) valid extensions in ZulMorph further emphasises its role as one of the most comprehensive computational models yet of Zulu morphology.

Modelling the Zulu verb lemma
Before explaining the computational modelling of the Zulu verb lemma, we return to the morphological challenges of Section 3.1and how we address them.Challenge (a) concerns the common ambiguity of human language for which no real solution exists except to deal with it through semantic context-based disambiguation at a later stage of processing -at the morphological level such limited over-generation will thus occur.Challenge (b) is non-rule-based and is met by hand-crafting the analyser to accurately model all the individual known cases.
Challenge (c) is closely related to aspect (i) in Section 4.1 and is the core of this section.In modelling verbs and their lemmas in ZulMorph, we make provision for different possibilities: a known basic root with no extensions, a known basic root with its own attested sequence(s) of extensions and a known basic root with an as yet unattested (i.e.new) sequence of extensions.Verbs based on basic roots that are not included in ZulMorph will not be analysed.As we shall see in this section, the distinction between morphology and the root lexicon becomes somewhat fuzzy in the case of the Zulu verb and its extensions in that the attested extension sequences of any specific basic verb root should be marked on the relevant basic root and thereby become part of the "lexicon".
In order to describe the modelling of the Zulu verb lemma and its meaning, we briefly explain the notion of Lexicon in lexc, as well as the technical use of so-called flag diacritics in both the Xerox toolkit and Foma.We show how they are used to record information about the verb lemma in lexc.

The verb root lexicon, extension sequences and flag diacritics
In order to keep explanations short, an example is used instead of trying to explain the technical details in a more general setting.The example lexc script for the root -hamb-is given in Appendix A. As a code fragment for explanatory purposes, it does not, for example, show how verbal prefixes are modelled.It consists of broadly four sections: the preamble in which certain so-called multicharacter symbols are declared, the verb root lexicon that typically contains thousands of roots, but for the example contains only the entry -hamb-, the modelling of the verb extensions and finally the morpheme lexicon containing the verb terminative -a.Each section is briefly discussed.
In the preamble two tags, [ATT] and [NEW], are declared for distinguishing between attested extension sequences for -hamb-and possibly newly discov-ered ones in the output produced by ZulMorph, as well as a number of socalled flag diacritics 12 that are used to mark the attested extension sequences of any particular verb root in the verb root lexicon (LEXICON VRoot) in the second section.This lexicon contains various entries for the verb root -hamb-each annotated with a P flag diacritic that encodes the specific attested extension sequence.It also shows the next continuation class (Lexicon VExt) containing the morpheme lexicon from which the next morpheme in the input verb should be matched.The third section shows the morpheme lexicons of next morpheme(s) (extensions) that may follow the basic verb root in accordance with the structure of the verb.As mentioned before, we distinguish between the basic root with no extensions, attested and new extensions.This is modelled by LEXICON VExt and its continuation classes VerbTerm, VExtAttested and VExtNew.In lexicon VExtAttested the R flag diacritic is used to match precisely the attested extension sequence that was marked by the corresponding P flag diacritic in the verb root lexicon entry.The lexicon VExtNew and the cyclic lexicon VExtNew2 model any new extension sequence of arbitrary length.The fourth section shows the last continuation class, LEXICON VerbTerm, which models the final verb terminative morpheme, here -a, followed by # to indicate that no further (input) morphemes may follow.
While the attested extension sequences are precise and correct, the cyclic modelling of the recognition of new as yet unattested sequences may cause over-generation in that any arbitrary (finite) sequence of extensions, even sequences that are semantically not plausible, will be recognised.This implementation is specifically useful for the purposes of mining new sequences of extensions from a corpus with the understanding that any new sequence will be subjected to human elicitation before inclusion in ZulMorph as an attested sequence.
ZulMorph contains 8 031 basic roots and 28 477 (extended) verb roots with attested extension sequences, bringing the number of entries in the verb root lexicon of ZulMorph to approximately 36 000.From the extensive data harvested from available paper dictionaries, grammar textbooks and other paper resources, 113 different extension sequences were identified, with the first 20 most frequent sequences (see Appendix B) representing more than 97% of all attested extensions.Statistics about the number of extensions per basic verb root are provided in Appendix C. We note that 22 of them allow between 20 and 30 combinations of one or more verb extensions.The number of lexicalised headwords, as recorded by Doke and Vilakazi (1964), is given in brackets.For example, the basic verb root in ZulMorph with the largest number of extensions, http://lexikos.journals.ac.za viz.30, is -fan-('resemble').The basic root -bon-('see') has 28 extension sequences.Moreover, ZulMorph contains 6 153 basic verb roots that have at least one attested extension and 1 878 that have no attested extensions.In Appendix D we list the basic verb roots that have the longest attested extension sequences, as recorded in ZulMorph, for example: (11) -ling-an-is-el-a 'equalise for/make equal for' The extensive coverage of both Zulu morphology and its verb roots, basic and extended, in ZulMorph provides the basis for the LKB of the next section.

Hand-crafting a basic LKB for Zulu
Hand-crafting a basic LKB for Zulu consists of a systematic and comprehensive usage of the expert knowledge that has been published and made available for Zulu.Three kinds of information need to be encoded -firstly the morphology, secondly the Zulu lemmas and thirdly their meanings.Since ZulMorph is an accurate model of Zulu morphology and its comprehensive coverage of Zulu verb lemmas was addressed in the previous section, we now turn our attention to the acquisition and inclusion of their meanings using a so-called expert knowledge-based approach, as already alluded to in Section 4.1 (ii)-(iv).More specifically, a meaning in the form of an English translation equivalent is assigned to each verb root and its extensions.While our main focus is on predictable meaning as a first step, lexicalised meaning is also considered.

Representing the meaning of the lemma
The first step in adding meaning to each basic verb root in ZulMorph is including the English translation equivalent to each basic verb root in the VRoot lexicon and the predictable meaning to each extension in the Attested, VExtNew and VExtNew2 lexicons.For example, the code fragments hamb(travel)@P.Basic.ON@: hamb@P.Basic.ON@ VExt; hamb(travel)@P.ExtEL.ON@: hamb@P.ExtEL.ON@ VExt; and an(each_other)[RecipExt]: an VExtNew2; el(for)[ApplExt]: el VExtNew2; is(cause_to)[CausExt]: is VExtNew2; el(for)[ApplExt]@R.ExtEL.ON@: el@R.ExtEL.ON@ VerbTerm; yield the following analyses 13 : The respective word sense pairs are (hambel, travel for/travel towards) and (hambelisan, cause to travel for/towards each other).Note the composite meaning in the latter pair.By adding basic meanings to the 8 031 basic verb roots and by including the predictable meanings of the various extensions (7 in total) we are able to provide not only a first approximation of the meaning of each of the ~36 000 entries in the verb root lexicon, but also produce word sense pairs for all the Zulu verbs that are based on these basic roots.Keeping in mind that the extensive Princeton WordNet for English has 11 529 verbs, the ZulMorph coverage of the Zulu extended verb root semantics is significant and can already be used in applications, as alluded to in Section 1.
Adding lexicalised meaning is the most resource intensive part of endowing ZulMorph verb analyses with accurate lexical semantics since it has to be added manually for each verb root individually.For each basic verb root and a particular extension sequence for which a lexicalised meaning is available, the meaning of the basic root is replaced by the lexical meaning of the extended root while the meaning of the extension that caused the lexicalisation is no longer explicit.The tag [LEX] shows that lexicalisation has occurred.As before, the predictable meanings of any subsequent extensions, if present, are still shown.By way of example we consider the extended root -hambel-, which also has the lexicalised meaning of 'visit'.Therefore, the verb root lexicon entry is as follows: hamb(visit) [VRoot]el [ApplExt][LEX]@P.Lex.ON@ @P.Basic.ON@: hambel@ P.Lex.ON@ @P.Basic.ON@ VExt; and yields the analyses The resulting word sense pair is (hambel, visit).
In summary, by annotating each entry in the verb root lexicon with its meaning (either predictable or lexicalised) and by providing the meanings of the 113 extension sequences, the morphological analysis of any Zulu verb will contain sufficient semantic information to support a basic notion of semantic linking or interoperability -a possibility that did not exist before.

Enhancing the Zulu LKB through a corpus-based approach
Improving and updating an electronic LKB to keep it current and maximally http://lexikos.journals.ac.za useful, specifically for an under-resourced language such as Zulu, is essential for its digital (web) presence, as discussed in Section 1. Having exploited available paper resources such as dictionaries, grammar textbooks, wordlists and terminologies etc., the obvious next step is to "mine" electronically available language corpora for new lexical information to add to ZulMorph.Such lexical information includes new verb roots, new extension sequences, and new (as yet unrecorded) lexicalised meanings of extended roots as they occur in authentic language use.For this purpose we propose in this section a semi-automated corpus-based approach to the extraction of new lexical information about verbs.
By way of example, the SAC (parallel English and Zulu versions) that has been sentence-aligned is used.It was chosen for mainly four reasons: firstly it is publicly available in all the official South African languages, secondly it is assumed to have been professionally quality assured, thirdly it is by its very nature well-structured and lends itself to accurate sentence alignment, and fourthly it uses contemporary formal language.The idea is that this process should be continued as new parallel corpora become available in due course.
The extraction of bilingual lexical information from bitexts 14 has a long tradition.Tiedemann (2011) provides an overview of techniques that may be applied for this purpose.Although he focuses on statistical approaches to word alignment, he also briefly discusses a number of non-statistical techniques for lexicon extraction from bitexts (Tiedemann 2011: 100-102).While automatic word alignment "is just too noisy to be useful for qualitative investigations", these non-statistical techniques "focus on the extraction of reliable translation equivalents", usually emphasising high precision links between words and multi-word units.
The approach that we follow in this article may also be seen as such a nonstatistical technique aimed at high precision.
New basic verb roots lead to morphological analysis failures.Through human elicitation and by individually considering these failures, new basic roots are identified and added to ZulMorph, together with their English translation equivalents.Alternatively, we could apply the guesser variant of ZulMorph to the failures and in this way obtain new verb root candidates.These also need to be subjected to human linguistic scrutiny before adding them to ZulMorph.The occurrence of a new extension sequence is tagged in the morphological analyses of a verb as [NEW].Such a sequence is then manually checked and added to ZulMorph, as shown in Section 5.For additional attested sequences for specific basic verb roots basically the same procedure is followed.
For the extraction of new (lexicalised) meanings and (extended) roots as they occur in authentic language use we employ bitexts -it is here that the sentence aligned parallel corpus plays a central role.For each sentence may we proceed as follows: 1. Perform part of speech (POS) tagging of the English sentence.For this purpose we used TreeTagger 15 .
2. Perform a morphological analysis of the Zulu sentence, using ZulMorph.
3. Isolate the verbs in the English sentence using the POS tags, and the verb roots and their extensions in the Zulu sentence using the morphological analysis tags for the verb root and its verb extension, and align these (the POS tags and morphological tags).This directly links the English lemma 16 , i.e. the new (lexicalised) meaning, which is our translation equivalent for the new Zulu word sense pair, and the Zulu (extended) verb root, the Zulu lemma in our new word sense pair.4. Add the information to ZulMorph so that it includes the new Zulu word sense pair.
In this semi-automated process steps 1 and 2 are automated while steps 3 and 4 as yet require manual intervention.Specific examples that have been extracted in this way are shown in Tables 2-6 17 .
In Table 2 we demonstrate how a new lexicalised meaning 'impart' has been detected for -dlulis-in the verb alignment process.In sentence <s103> 18 the English verb 'impart' links up with the extended Zulu verb root -dlulis-, forming a new lexicalised addition to those already listed for -dlulisa in the ZED (1964: 162), namely 'cause to pass; carry past, send past …'.Verb alignment between a new lexicalized meaning 'impart' and the Zulu lemma dlulis therefore results in the new word sense pair (dlulis, impart).
Table 2: New lexicalisation of Zulu lemma dlulis Table 3 demonstrates verb alignment between the new lexicalized meaning 'limit' and the Zulu lemma nciphis to form a new word sense pair (nciphis, limit).The English verb 'limit' in sentence <s286> links up with the extended Zulu verb root -nciphis-and produces a new lexicalised supplement to those already http://lexikos.journals.ac.za listed for -nciphisa in the ZED (1964: 532): 'diminish; make small, less; minimize'.
Table 4: New lexicalisation of Zulu lemma bhalis http://lexikos.journals.ac.za In Table 5 it becomes clear how a new lexicalised meaning 'affirm' has been identified for the Zulu lemma qinisekis in the verb alignment process.In sentence <s51> the English verb 'affirm' links up with the extended Zulu verb root -qinisekis-, forming a new lexicalised addition to those already listed for -qinisekisa in the OZSD (2010: 198): 'make sure; make certain'.Verb alignment between a new lexicalized meaning 'affirm' and the Zulu lemma qinisekis therefore results in the new word sense pair (qinisekis, affirm), which could also qualify for inclusion in a dictionary such as ZED, where -qinisekisa has not yet been listed as headword.The same procedure applies to the occurrence of the extended Zulu verb root -qinisekis as occurs in <s45> and < s157> respectively, resulting in two further new word sense pairs (qinisekis, ensure) and (qinisekis, secure).ku-hlinz-ek-el-w-a 'be provided for' <s174> zi-hlinz-ek-el-w-e 'be provided for' <s1434> ku-nga-hlinz-ek-el-w-a 'may be provided for' <s1629> Comments -hlinz-ek-el-w-is not listed as headword in the ZED (1964), nor is the extension string -ek-el-w-listed under the entry -hlinzeka "get skinned, murdered, operated upon … prepare food for expected visitor" (ZED 1964: 329).
The extension sequence ek-el-w-in combination with the verb stem -hlinza does not occur in the monolingual ISZ (2006: 486), and it is also not an attested combination in ZulMorph.

Conclusion and future work
We have shown how ZulMorph, a comprehensive hand-crafted finite state morphological analyser for Zulu, and a small electronically available parallel English-Zulu corpus, namely the South African Constitution (SAC), which is an official document of the highest order, translated into all official languages, can enrich Zulu lexical semantics with English as pivot language.
While our approach to enhancing ZulMorph to produce Zulu word sense pairs applies to all word categories, our focus was on the verb as the morphologically most complex word category in Zulu.This complexity arises mainly from (sequences of) verb extensions that are suffixed to the basic verb root to produce modified or new verb meanings.We noted that although a morphological analyser may provide accurate morphological analyses of Zulu verb constructions, these analyses do not offer much information in terms of the meaning of the verb.This constitutes a major impediment to a computational understanding of what a Zulu verb means, and therefore also to applications such as, for example, information extraction from Zulu text, question answering in and from Zulu, machine translation between Zulu and any other language and Zulu natural language generation.In this article we presented a Zulu LKB that uses the well-resourced English language as pivot language towards addressing this challenge.
It is important to note that for a language such as Zulu (morphologically complex and under-resourced) statistical and machine learning approaches have not yet yielded sufficiently accurate results for the applications mentioned above.Recent experience has shown that building the necessary high-quality, sufficiently large electronic corpora for Zulu has proven more difficult and expensive than handcrafting ZulMorph.This is clear from the fact that Zul-Morph actually exists while no corpus-driven statistical approach to Zulu computational (verb) morphology has, as yet, yielded results that are comparable to those of ZulMorph.It is our view that the Zulu LKB that we have reported on in this article has the potential to serve as an important and novel component in future hybrid systems (robust combinations of handcrafted, rulebased, statistical and data-driven machine learning approaches) for Zulu lexical semantics.
Our core contribution is twofold: -the enhancement of ZulMorph to constitute a large basic LKB for Zulu that, for any input verb, produces a word sense pair consisting of the Zulu lemma of the verb (here the extended root) and its meaning (here its English translation equivalent).The meaning is computationally composed from the meaning of the root and the predictable meaning of its verb extensions; -a proposed semi-automated corpus-based approach in which existing NLP tools, viz.TreeTagger and ZulMorph, and a in the form of the electronically available sentence-aligned English-Zulu parallel corpus, are used to expose new verb roots, new extension sequences and new lexicalisations of existing verbs and their extensions for addition to the Zulu LKB.
Future work may include increasing the automation of the process while also extending the process to other word categories to offer a more comprehensive Zulu LKB.We also envisage using further parallel English-Zulu corpora across a variety of domains as they become available to extend ZulMorph and the Zulu LKB, and eventually experimenting with the use of the Zulu LKB in some of the mentioned applications.In the longer term we may consider developing LKBs for other languages for which finite state morphological analysers are available.
The canonical or so-called citation form of a surface word form.For example, write is the lemma of the surface forms writes, wrote and written (cf.Section 3). 2.
Lexicalisation is also discussed in detail in subsequent sections.4.
For the sake of convenience a verb root followed by one or more extensions, is called an extended root in this article.5.

6.
A word is taken to be a surface word form as found in a sentence or an utterance; a lemma is a specific grammatical form of a word, often also referred to as citation form or canonical base form; lemmatisation is the process of mapping a word to a lemma; meaning is the denotation, referent, or idea associated with a word; and a translation equivalent is a corresponding word or expression in another language (see, for example, Jurafsky and Martin 2009: 645;Gurevych et al. 2016: 1).
In English the canonical base form of the verb (travel, travels, travelling, travelled) is 'travel'.

9.
While we consistently use the hyphen (-) to indicate morpheme boundaries, we view the lemma as an entity that can stand on its own in the context of a word sense pair and therefore the notion of morpheme boundary is not important and therefore not indicated.
10.A discussion of such applications falls outside the scope of this article.
11.The detailed explanation of the lexc and xfst languages falls outside the scope of the article.
The interested reader is referred to Beesley and Karttunen (2003).
12. Flag diacritics provide a light-weight approach to feature-setting and feature-unification operations for enhancing modelling accuracy and runtime efficiency.Specific uses are to enforce separated dependencies and mark idiosyncratic morphotactic behaviour (see Beesley and Karttunen 2002) for a comprehensive exposition).In lexc and xfst flag diacritics are socalled multicharacter symbols with a distinctive spelling: @operator.feature.value@and @operator.feature@where the operators are P (positive (re)setting), N (negative 13.The morphological tags, enclosed in [ and ], are listed in Appendix E.

Figure 1 :
Figure 1: Left-associativity of the compositional meaning of the extended verb root -boniselan-, with 'let see' lexicalised as 'show'.

(
re)setting), R (require test), D (disallow test), C (clear feature) and U (unification test).The features and values are specified by the user.In ZulMorph flag diacritics are used extensively to, amongst others, model the Zulu noun class system (Bosch and Pretorius 2002; Pretorius and Bosch 2003), long distance dependencies (Pretorius and Bosch 2008), part of speech information and a wide variety of other morphotactic constraints that apply in Zulu.In this article the focus is on their use for annotating each basic verb root with its valid and attested extension sequences.

Table 3 :
New lexicalisation of Zulu lemma nciphisVerb alignment between the novel lexicalised meaning 'register' and the Zulu extended verb root -bhalis-is shown in Table4.A new word sense pair (bhalis, register) is created for possible inclusion in dictionaries (e.g.ZED, and isiZulu.net)where -bhalisa has not yet been listed as headword.It should be noted however, that the SZD (1969: 309) lists -bhalisa as headword with the meaning 'put name on waiting list', while the OZSD (2010: 18) does in fact list -bhalisa with the meaning 'register'.

Table 7 :
New root -chibiyel-identified from parallel bilingual SAC corpus

Table 8a :
New extension sequence -is-an-for -xox-identified from parallel bilingual SAC corpus