A General Lexicographic Model for a Typological Variety of Dictionaries in African Languages

94-115


Introduction
This article is concerned with the design of a lexicographic model, that is, a model of a data structure capable of storing lexicographic data, which will subsequently be used to compile several types of prototypical dictionaries for a selection of African languages 1 .We keep in mind that there are no hard and fast rules for any typological model, but rather that different types of dictionaries may have certain features in common (Gouws and Prinsloo 2005: 45).In the last few years, several such lexicographic data collection models were published; the most general of all is the ISO standard for lexicography (ISO 24613:2008).This "Lexical Markup Framework 2 " (LMF) builds the background for several existing lexicographic data collections.A data collection model is not a database as such, but is defined as a standoff-XML-formatted framework of a number of files plus several external sources, each describing a different aspect of the dictionary that is compiled from them.For example, general data such as language or language coding is included, but also microstructural data related to lemma signs, such as information about its part of speech or its orthography.Concerning the possibilities to connect with other sources of information, we agree with Spohr (2012: 23) who states that although LMF describes itself as interoperable, "it remains rather vague on its application in the various contexts, and in particular of its application in human usage situations".Spohr's general graph-based formalism (Spohr 2012) can indeed be seen as an implementation of the LMF data model.His lexical resource, implemented in a graph based OWL model, is based on a typed formalism, similar to the adaptations the WWW is taking up to become the new Semantic Web (Spohr 2012: 38).Spohr places the lexeme in the focal point of the database, linking it for instance to its forms and senses (ibid.p. 68).He nevertheless states that "ideally, we would like senses to be the primary lexical entities, as all kinds of lexical relations seem to be defined between senses" (ibid.67).Spohr, however, then argues against this concept saying that beginning with the item giving the sense (i.e. the item giving the paraphrase of meaning), it would not be possible to fill all other dependent fields, especially when acquiring lexicographic data from corpora (ibid. 68).This issue will receive further attention in section 6.3.
We want to mention two further publications here, which describe a lexicographic database or data collection model for generating online dictionaries in particular.A database that supplies several dictionaries for specific purposes with data is described by Bergenholtz and Bergenholtz (2013).In their article titled "One database, four monofunctional dictionaries", the kind of model that was utilized is unfortunately not mentioned.However, they do point out some items defined for the resulting database, as well as the fact that the compilation of several online dictionaries from one database, calls for a number of issues concerning its access features to be taken into account -see also our section 5 below.Bosch, Pretorius and Jones (2007) propose a model for machine-readable lexicons, not only for the South African Bantu languages, but for the Bantu language family as a whole.The data model in the form of an XML DTD is intended to include all linguistic information of the languages in question and "provides flexibility and handles the various representations specifically applicable to Bantu languages, thereby making it applicable to diverse uses of machine-readable lexicons" as language resources for use in large-scale HLT/NLP applications.Only a fragment of the DTD is presented in the publication.
The majority of articles concerned with online dictionaries, however, refers to their visual representation (e.g.Prinsloo 2010 which is related to their implemented access strategies), others are concerned with the acquisition of data to populate lexicographic databases (e.g.L'Homme 2012 and Scholze-Stubenrecht 2013).
The research for this paper resides within a project entitled "Scientific e-Lexicography for Africa (SeLA) 3 " (i.a.described by Heid 2012), and it is carried out by the University of Hildesheim (Germany), the University of South Africa and the University of Pretoria, Stellenbosch University (South Africa), and the University of Namibia in Windhoek (Namibia).The project intends to combine all of the above-mentioned issues: (1) designing a prototypical multifunctional database with the aim of compiling several monofunctional electronic dictionaries for the African languages; (2) solving the problem of data acquisition for resource-scarce languages; (3) defining "exactly which types of lexicographic data from the fact collection need to be selected in order to satisfy a given user need, as well as in deciding in which way such data have to be ordered and formatted (presented) for users with a given background and a given type of need" (Heid 2012: 438).
In the SeLA project, we are concerned with a multilingual African language data collection to be used for lexicographic purposes which we will store in a mySQL database.For the time being, the aim is not to compile comprehensive dictionaries from the database.Seeing the final implementation as a prototype, we plan to use this database for several other purposes, for instance, as part of intelligent Computer Assisted Language Learning (iCALL) software.
We consider it necessary to strictly differentiate between the database, which should be flexible, in other words, open to internal and external resources (so far unknown) to be added in the future, and the presentation of the (internal and external) data to the users, which depends on their requirements (see section 5).We also foresee access to a prototypical Natural Language Processing (NLP) machine performing morpho-syntactic analyses.
The database model is to be implemented with a MySQL database.Such a database may consist of (1) content tables containing the data itself, (2) relational tables linking data items with one another, and (3) tables generated from the data and their relations which are used for a faster access.One might wonder why we do not use XML/OWL, like the most up-to-date data collection models described above.Besides the fact that the SeLA team lacks the capacity to develop a full-scale Dictionary Writing System (DWS) or to make use of one to compile a full-scale dictionary, we consider a populated MySQL database implementation as equal to a standoff XML system.In both systems, all necessary data items can be described and a number of types of relations between those data items can be modelled.SQL, however, additionally allows for a fast and easy implementation without the need for DTDs, XML-editors or (commercial) Dictionary Writing Systems.Moreover, together with phpMyAdmin 4 , an online dictionary and the necessary maintainer facilities are speedily and simply implemented with a few PHP scripts.Another point of consideration is that most of the data will be imported from existing resources, which will populate the fields of the database only partially.The task of filling the gaps and generating full-scale dictionaries must be postponed to a later stage.To use MySQL for a start, does not imply that XML/OWL will not be used in the future.In such a case, the means will be found to fill the database with sufficient data to compile comprehensive dictionaries, and porting one system to the other will indeed be possible.
In summary, we describe a lexicographic model in this article which should fulfil various requirements: (1) it should be open to a number of lexicographical functions as several different monofunctional online dictionaries will be compiled from it; (2) it should cover the specific linguistic phenomena of the languages belonging to the Bantu language family; and (3) concerning data acquisition -as we will need to populate the database with any relevant data that can be collected semi-automatically -the database should be tolerant of missing data items, even if they are considered essential for producing a dictionary.Furthermore, we will describe our current approach towards data acquisition and data accessibility.

Aims
Our aim as part of the SeLA project is to design and develop a lexicographic database that will contain multilingual data of three of the official African Languages of South Africa (i.e.Zulu, Northern Sotho and Xhosa).For some of these data sets, translation equivalents of South African English will be stored too.The data of other African languages, as well as Afrikaans, are foreseen to be added at a later stage.We begin by developing a database model, with the aim of fulfilling all the requirements to describe the language items thoroughly, while taking into account the languages in question and the external resources that are currently available.We take Spohr's (2012) data collection model into account too; however, as Spohr has suggested, we focus our attention on the polysemous senses of a word -the above-mentioned disadvantages (see section 95) only play a minor role for us, as is the case with the languages concerned, there are only few resources available which would allow for an automated filling of the database -most data will have to be added manually.The database will be utilised to compile a typologically diverse collection of prototypical monofunctional dictionaries (however, with few data sets), of which the majority are planned to be bilingual.Hence, we look at requirements of a good outer and inner access structure (see section 5), resulting in the design of different dynamic graphical user interfaces (GUIs) to be developed.We will then examine ways and methods to import available external resources (the respective plans are described in section 6).Lastly, we plan to bind the resulting database into a language portal, a framework of lexicographic and other resources.We foresee linking it with other dictionaries, corpora, or other databases containing linguistic data, such as the ontology database of the part-of-speech items of Zulu and Northern Sotho described by Faaß, Bosch and Taljard (2012) or the e-learning tool "eZulu dictionary of possessives" assisting learners of the language in acquiring knowledge about producing possessives structures in Zulu, described by Bosch and Faaß (2014).
Setting the aims as described above, we need to examine aspects regarding macrostructures and microstructures of the foreseen dictionaries.On this basis, the data model can then be designed. http://lexikos.journals.ac.za

3.
Aspects regarding the macrostructure and microstructure

Macrostructural elements for Bantu language dictionaries: a challenge of lemmatisation
The agglutinating nature of the Bantu languages that goes hand in hand with a complicated nominal and verbal derivation system, indeed poses challenges for lemmatisation (Gouws and Prinsloo 2005: 67).Different approaches to lemmatisation, the main one being word versus stem lemmatisation in the case of nouns and verbs, play an important role in dictionary compilation.
Because of the conjunctive writing system of Zulu, whereby parts of speech are written together, even full sentences may appear as one orthographic word.The sentence bazokubona "they will see it", for example, consists of several morphemes; ba-(subject concord of noun class 2) -zo-(future tense marker) -ku-(object concord of noun class 15) -bon-(verb root = "see") -a (verbal ending).We do not foresee to enable our system to analyse such input data, however, linguistic verbs consisting of several morphemes should, in principle, be analysed so that users can receive the data on the items related to their query.Users interested in stems on the other hand, should also be able to query those and get the data on all full forms containing a particular stem.
Concerning the disjunctively written Sotho languages, there are other challenges: The copulative of Northern Sotho, for example, consists of one or several, disjunctively written morphemes.These morphemes are highly ambiguous and the copulatives generated from them are homographous, too.The many forms cannot all be described in a printed dictionary due to space constraints.However, even in an electronic dictionary, the task of describing all forms might turn out to be too complex.An attempt has been made to extract these forms from corpora by using regular expressions (Faaß and Taljard 2013), however, due to the many homographs, no system to distinguish them could be found.Such rather morpho-syntactic challenges can be related to the issue of accessibility.We therefore do not see the electronic dictionary itself as the best solution, but rather develop connected systems that could, for instance, assist learners in producing the correct form, such as a decision tree-like device (described in Prinsloo, Bothma, Heid and Faaß 2012).
In an electronic dictionary, these analyses of input data, however, belong to access structure (see section 5), not to the data storage itself.One could, therefore, argue that in a lexicographic electronic data collection there is no macrostructure at all.
We place the sense element at the centre of our database, and since we link this sense with one (or more) orthographic forms and with a stem, we enable our system to allow for immediate access to stems of verbs and nouns, for instance, the Northern Sotho verb stem bona "[to] see", but also to full forms such as the Zulu address sobonana "see you (again)".Therefore, in terms of orthographic forms, we foresee simplex and complex words which are both related to sense elements.
The change of focus is exemplified in the following two figures.Figure 1 illustrates a possible entry describing the English verb "[to] see" and its Zulu counterpart " [uku]bona" in a traditional lexicographic database where the lemma is the central element, and is linked to two senses, each extended with an example.The two translation equivalents are linked with each other.
In Figure 2, the same data is viewed from the perspective of our proposed model where English and Zulu data are entered independently, similar to Figure 1.The relational table "is_translation_of" informs that sense 1 and sense 3 are translation equivalents.Note that in Figure 1, the metaphorical sense of "[to] see"/"[uku]bona" was described in each language in the element "sense 2".In the new model, such a sense description does not appear as such.Instead, a literal sense description of "[to] understand"/"[uku]qonda" is included together with an example ("I understand what you mean.").The literal senses 1/2 ("[to] see"/"[to] understand") and senses 3/4 ("[uku]bona"/"[uku]qonda") are then linked with each other by items in the table "is_synonym_to" (see section 3.3).In this table, we learn that the synonymy is metaphorical and we also see the respective example sentences ("I see what you mean"|"Ngiyabona ukuthi uthini").

Microstructural items
We began with a general list of items which are usually part of the microstructural items in any dictionary, such as the lemma sign, its paraphrases of meaning, etcetera.For each of these items, we decided whether we require them for our database.Afterwards, we added all items that usually appear in the respective African language dictionaries that we are concerned with.We subsequently categorised the items, which we currently foresee: we generally differentiate between the categories "descriptions", "morpho-syntax", "phonetics", "etymology", "valency", "examples" and "idioms".Each of the tables representing these categories contains its microstructural items.As described above, we need to differentiate between data items to be filled for stems (the ones that are not identical with full forms) and data items to be filled for full forms.Table 1 shows the items foreseen, irrespective of the language they belong to.For the African languages, we add information on whether the item is described for full forms, for stems or for function words.

Relational tables
While and after the data items are stored in the database with their respective descriptive items, additional tables describing the relations between them will be defined.In addition to the usual morpho-syntactic relations (e.g."is-pluralof"), semantic relations are described too (e.g."is-near-synonym-of").So far, we do not foresee adding WordNet data.However, this is possible from a technical perspective, since the development of a prototype African Wordnet (AWN), which currently includes four languages, is an on-going project (Griesel and Bosch 2014).The resource has been developed by translating Common Base Concepts (CBC) from English and currently holds roughly 42 000 synsets.
To assign translation equivalents, we use the relation "is-translation-of".A rather general relation will be added as well: "is-linked-with" will contain relations between items not described in the others (i.e.miscellaneous kinds of relations that appear not frequent enough to give reason for an own relational table ).This last table, however, will contain a data field where the type of relation is explained.
We relate senses of lemmas with the following tables: -is-diminutive-of (for nominal items only) -is-plural-of (for nominal items only) -is-locative-of (for nominal items only) -is-stem-of (see lemmatisation strategy above) -is-translation-of (relates items of different languages to each other) -is-contained-in-example-sentence -is-contained-in-fixed-expression http://lexikos.journals.ac.za -is-contained-in-idiom -has-morpho-syntax (relates a specific id of a type of morpho-syntactic item to one sense) -has-phonetics -has-valency (relates a specific id of a type of valency to one sense of an item taking arguments) -is-linked-with For space reasons, we describe only two of the tables in the following sections.

"has-morpho-syntax"
In any typical dictionary, the microstructure contains information on morphology and syntax of a lemma.Such information is repetitive not only for parts of speech appearing several times, but also for their morphological properties.Plural morphemes of English, for example the "-s" appearing in nouns like "type -types", "house -houses", must only be described once in our model.We foresee to fill a table called "morpho-syntax" with all the appearing categories (e.g.noun, -s).Each of the categories receives a unique id.In the relational table "has-morpho-syntax", we link the sense descriptions with one or several id(s) of morpho-syntactic categories that apply to them.

"has-valency"
Concerning the valency (or "valence", as described by Spohr 2012: 86f) of a lexicographic item, a similar situation occurs: one type of valency, for example "verb, taking no object" can be linked with several words ([to] sit 5 , [to] walk, etc.).We handle the situation in the same way as the "has-morpho-syntax"-table described above.A unique id is assigned to each valency type and sense descriptions are then related to the ids that apply to them.Some relations between items will be added manually.For this purpose and the purpose of checking and correcting the data that will be inserted automatically (see section 6.4), the database will offer a maintainer interface.

Design and implementation method
In this section, we compile the items described above and define a basic lexicographic model where each category represents one table of the database (DB), see Figure 3 which, due to space constraints, does not show all of the items.In our model, we tentatively define relations between items, however, keeping them open for future changes by storing them into separate tables.In MySQL, each item is identified via an "id"-data element (e.g."sense-id" identifying one specific paraphrase of meaning).Such identifiers are marked as "primary key", which means that each may only appear once in the respective table.In the model shown in Figure 3, each of the items contained are to be pre-defined in respect of their type, "int" stands for integer, "varchar" for any kind of character.Lastly, "link" means that a URL will be entered.In an SQL database, relations between items of tables are to be described, which reflect dependencies between items (we can also define item as "hierarchies", as it is done in XML or in object-oriented database systems).A paraphrase of meaning, for example, should directly be related to one or several example sentences, similar to an integrated microstructure.The relation between those items is, therefore, 1:n where "n" stands for any integer number greater than zero.For example, the relation between the items "sense-id" of the table "descriptions" and "sent-id" of the table "example sentences" could be defined as "1:n".However, it could very well be the case that we could use one example sentence several times, by assigning several lemmas (or rather senses of those words) to it, therefore, we do not enforce the 1:n relation by directly linking items (e.g.foreign keys), but rather implement the word sense/example sentence relation by assigning a unique key to each of those items in the respective tables and by adding a separate table linking those ids to each other, see Figure 4. http://lexikos.journals.ac.za

Figure 4: Adding relations between word-sense and example-sentence
The positive aspect of such an implementation is its openness towards a redefinition of relations between items; a negative aspect might be that such tables lead to a slow query processing of the database.Therefore, in our second phase of implementation (i.e. after the available data will have been stored in the database), we will automatically generate additional tables each containing all relevant data for one of the dictionaries.Users will have access to each one of these tables with one mouse click and one or several query words.

5.
Data presentation: access structure Bergenholtz and Gouws (2010: 103) maintain that "of critical importance in a user-driven lexicographic approach is the need to ensure that the target users of a specific dictionary gain unimpeded access to the data they need in order to achieve an optimal retrieval of information".Such accessibility is typically ensured by the access structure of any given dictionary.We adhere to the definition of Wiegand and Beer (2013: 111), who define accessibility as follows: "The term 'data accessibility' refers to the access willingness and thereby to the possibility to look up textual and illustrative lexicographical data; it is given because the data are in the access domain of an access structure.A distinction is made between the external and the internal data accessibility".
In printed dictionaries, the first step is determined by the knowledge a user has of the specific dictionary.A user could embark on either the full or a shortened outer access process, reaching the desired lemma via a rapid access structure, for instance; thumb index markers or alphabet letters, or by merely guessing where the relevant item will be and then following the running heads until the desired page has been reached.Going down the lemmata, the desired guiding item can then be found -the item, where the inner access route commences.In e-dictionaries, a single word or multi-word string is typically typed into the search box and this will immediately guide the user to the required lemma sign without bringing any other outer access items into play.Other systems offer a rapid access structure in the form of a list of clickable lemma signs of which the user can select the required one.It is also possible to offer both, as described by Bothma and Gouws (2013).
The selection and the order of appearance of the data items both depend on several factors: (1) The type of dictionary; a bilingual dictionary will require a translation equivalent to appear, while a monolingual will not.(2) The part of speech of the lemma; some parts of speech need to be displayed with valency information, for others, valency plays no role.(3) The access route; the first resulting screen of a query will display only few items, from there, the user may click respective boxes on the screen to get more data (e.g.etymological information, idioms or example sentences).For each of the microstructural items above we need to define when it will appear on the screen (given that an orthographic form was entered as a query and this form was found in the database).Table 4 shows these decisions for several of the microstructural items above when a general monolingual dictionary is compiled; due to space constraints, not all assignments can be shown.

Solely on demand short paraphrase of meaning
Table 4: Examples of microstructural items being assigned to specific use situations In section 5 above, we mentioned that for each of the foreseen dictionaries we will generate one table in the database containing all the necessary data.Table 4 above shows their elements for the planned monolingual dictionary of Northern Sotho. http://lexikos.journals.ac.za

External links
From a technical perspective, the database is planned to be connected, inter alia, with a morphological analyser.This is essential especially for the African languages that are written conjunctively; a user may enter, for instance, the orthographic word abazukukhombisa "they will not show it" -without knowing that this expression consists of a number of morphemes: a-(negative morpheme), -ba-(subject concord class 2), -zu-(future tense negative morpheme), -ku-(object concord class 15), -khomb-(verb root), -is-(causative extension), -a (verbal ending).Whenever such a query word is not found as a lemma by the database, this morphological analysis will be executed in order to deliver the linguistic units and their parts of speech which will be queried automatically by the system.The user will then see the results for each of the parts presented by the system and can select the items he or she is interested in to get further information displayed.
On the other hand, a user might enter a stem of a word; in this case, we will use the morphological analyser as generator and will generate full form words which could be queried in the database.It is foreseen to then suggest this list to the user, in order for the user to subsequently choose the ones he or she wants to know more about.Concerning productive purposes, we also foresee (user-activated) connections with the decision-tree system developed in the framework of the SeLA project (e.g.described in Prinsloo, Bothma, Heid and Faaß 2012).Another option will be to access corpus data, however, only maintainers will be allowed to see the whole of the data, as one cannot assume that all corpus data would be usable for exemplifying the meaning of a word (see section 6).The maintainers then will be able to choose example phrases or sentences to be added to the database.

6.
Resources to be added to the database It would be virtually impossible to fill such a database from scratch -corpora are scarce and the ones that do exist lack a description of their contents and are, therefore, not feasible for an automated retrieval of dictionary contents.However, there are some resources that we can indeed utilise for a start, as described below.

Available resources for the project
Language data for Northern Sotho is currently available in the form of a printed dictionary (Ziervogel and Mokgokong 1985), which was scanned 6 into electronic format by means of Optical Character Recognition (OCR) and transformed at least partially to a structured data collection (Kebbe 2013).We also use a MySQL database containing about 600 full Zulu forms and their English translation equivalents, generated in the SeLA sub-project on a Zulu dictionary of possessive constructions (Bosch and Faaß 2014).Lastly, we also have access to a file containing several thousand Xhosa nominal stems, information on the noun classes they appear in and their translations into English.

Other possible resources
In South Africa, the co-ordination of language resources is still in its infancy stages, however, the function of the newly established Language Resource Management Agency (RMA) is to develop and host reusable text and speech resources, and to manage and distribute these from one central point.Currently, relevant resources available are Annotated Text Corpora for all official languages of South Africa annotated with lemma, part of speech and morphological analyses.Initial versions of core technologies, namely lemmatisers, part of speech taggers and morphological decomposers are available as open source modules and could, therefore, be used for the annotation of text corpora of the various Bantu languages, although (Eiselen and Puttkammer 2014: 3702) point out that "there is still a lot of room for improvement, especially for lemmatisation and morphological decomposition".

Adding resources to the database
Despite several corpora for the African language that we are permitted to use for the purpose of, for instance, checking corpus frequencies of occurrences 7 to be added (manually) to the database at a later stage, we were also able to get access to a scanned dictionary (Ziervogel and Mokgokong 1985).Unfortunately, the files we received were in word format, and all items were in the same font, so it was impossible to automatically identify item types by their format.Judith Kebbe, a student of information science at University of Hildesheim worked out an automated method to identify item types by their position in the dictionary article (Kebbe 2013) and wrote Perl scripts extracting those items, based on the descriptions of Faaß, Ramagoshi and Sebolela (2009).Her work resulted in structured, machine-readable data covering about half of the entries of the dictionary.As it turned out, however, the microstructure of this dictionary is not structured consistently; when trying to extract translation equivalents, especially, the automated method often failed.Another problem is described by Kosch (2013: 204) who points out the mixed lemmatisation approach of this dictionary, whereby a word approach is applied to nouns with irregular or non-overt class prefixes, although the overriding approach in the dictionary is stem based.The user is then given a cross-reference to the relevant stem.Examples are nouns such as mmuši, "ruler" and pono "vision", which are lemmatised as words and not as stems.According to the stem-based approach, the lemmatisation of the two nouns would have presented as buši and bono, http://lexikos.journals.ac.za derived from the verb stems -buša "rule" and -bona "see" respectively.Kebbe extracted several thousand links between dictionary entries, but only few dictionary entries describing translation equivalents.Hence, these data will be loaded into our database to cater for monolingual Northern Sotho only, mainly to test relational tables such as for instance "is-linked-with".
Bosch and Faaß (2014) populated a MySQL database with about 600 Zulu nouns and about 900 English translation equivalents, there is also information on their classes and numbers stored in this database.We will transfer these data to the SeLA database as well.
Bilingual Xhosa-English data was made available to us in .xlsformat.Here, not surface forms but several thousand noun stems, the classes they appear in, class prefixes, and English translation equivalents are contained.By way of shell scripts, we will generate full forms and fill the database with the respective data.
With the available resources, we cannot fill the sense descriptions in most cases; therefore, we will have to add them manually.During the import of the data, we plan to use English translations to have these mandatory fields filled, but these will have to be replaced manually with monolingual sense descriptions.As our team will not have the manpower to fill all of the foreseen database items, we plan to send out calls to the public, trying to find volunteers, as soon as the graphical user interfaces have been completed.For our aim to compile prototypical dictionaries, we consider the available data to be sufficient.

An example: monolingual Northern Sotho data
This article describes a lexicographic model which is still awaiting implementation.While implementing it, we might find errors or inconsistencies that will force us to change the model.Therefore, at this stage, we can only describe data that was examined during the development of the model.We chose the dictionary of Ziervogel and Mokgokong (1985) that contains several thousand noun stems with additional information.One of the dictionary entries contains data on the noun stem mente: With these data, we cannot provide a Northern Sotho sense description to fill the mandatory item "short description" in the "descriptions"-table of our database.In a first attempt, we hence foresee to write scripts that make use of the English translation.The scripts however add the note "TO-BE-TRANSLATED-INTO-NSO" as an indication for the manual reworking which is foreseen at a later stage.The "language"-field can be filled automatically because we know that this is NSO data.Optional elements (as shown in Figure 3) are not filled: 1. Table "descriptions": sense-id: 1, language: NSO, short-description: TO-BE-TRANSLATED-TO-NSO: mint (where money is coined), paraphrase-of-meaning: empty field, subject-area: empty field, frequency: empty field, visualization: empty field.
Next, we process the information on morpho-syntax: cppl stands for "class prefix plural" which is an indication that this new database entry describes a noun.As this prefix is di and as no singular prefix is given, the scripts can assume automatically that the noun is of class 9 which means that its orthographic form of the singular is identical to the stem (mente).Therefore the singular orthographic form is mente and the plural form dimente. Since this dictionary uses diacritics to indicate tone, we also learn about the high tone on the second vowel.The scripts can hence fill several tables: 2. "morpho-syntax": morph-id: 1, part-of-speech: noun, person-number-class: 03-sg-09, morph-id: 2, part-of-speech: noun, person-number-class: 03-pl-10.
Lastly, the scripts will fill the necessary relational tables creating links between the items.

Summary and future work
This article describes the design of a lexicographic data model which will be implemented with MySQL, resulting in a database capable of storing lexicographic data of several of the official languages of South Africa.We aim at compiling several prototypical dictionaries from there: a monolingual Northern Sotho dictionary, a bilingual Xhosa-English general language dictionary and a bilingual English-Zulu learners' dictionary.We have compiled lists of necessary microstructural elements and have decided to put the sense description at the centre, the "lemma" being just a realisation of the sense, in other words its surface form.
We have collected a number of resources, which will be loaded onto the database semi-automatically.At this stage, it is foreseen that all missing data items will require manual adding due to the lack of available resources.It is well known that the development of resources for African languages is often of a fragmented nature -the resources tend to be small, only usable for restricted purposes and, therefore, excluding connection with other resources.We, therefore, intend to investigate collaborative approaches and technologies for the accumulation and creation of data to ensure the continued filling of this lexicographic database (cf.Benjamin 2014).

Figure 1 :Figure 2 :
Figure 1: Illustration of the traditional data model: focus on the lemma

Figure 3 :
Figure 3: The basic database model showing tentative relations between data items

Table 1 :
Microstructural items for all languages contained in the databaseThe items we need for the African languages only are listed in Table2.

Table 2 :
Microstructural items contained for the African languages only http://lexikos.journals.ac.zaLastly,Table3contains the items only used for Afrikaans or English, respectively.We do not claim the tables to be comprehensive, other items might be added at a later stage.

Table 3 :
Microstructural items necessary for non-African languages only