Corpus-driven Bantu Lexicography Part 1 : Organic Corpus Building for Lusoga

This article is the first in a trilogy that deals with corpus-driven Bantu lexicography, which is illustrated for Lusoga. The focus here is on the building of a so-called 'organic corpus' from scratch, while the next two instalments will deal with the use of that corpus on the macrostructural and microstructural levels, respectively. Not many detailed descriptions of corpusbuilding efforts exist for Bantu languages, so each and every step is discussed in detail, paying particular attention to the parameters that have to be taken into account, while not losing sight of the need to log the metadata either.


1.
Goal of the present study In this article we wish to show how an electronic corpus for a Bantu language, especially an under-resourced Bantu language, may be assembled from scratch.
We have lexicographic applications in mind, but such corpora may also be used (and have successfully been used) for Bantu corpus linguistics studies more generally.While Bantu corpora have been built for about two decades now, explicit descriptions of their composition are rare in the literature.For instance, in his MA dissertation de Schryver (1999: 103-117) devotes about 14 pages to the design, structure, contents and text collection of a 300 000-word Cilubà corpus, but to this date that study remains unpublished.When it comes to the descriptions of the corpora that have been assembled for the South African Bantu languages, these are typically less than a page long (de Schryver and Prinsloo 2000).On the other hand, corpus stability tests have been carried out for the South African Bantu languages (Prinsloo andde Schryver 2001, Prinsloo 2015), as well as attempts at multilingual corpus building and multilingual data extraction (de Schryver 2002, Prinsloo and de Schryver 2005).Scientific articles on the Zimbabwean corpora built under the umbrella of ALLEX/ALRI tend to focus on specific topics, such as tagging issues for a Shona corpus (Chabata 2000) or the sociolinguistic, political and economic considerations that influence the contents of a corpus of Zimbabwean Ndebele (Hadebe 2002).Even the latest version of the widely-used Helsinki Corpus of Swahili is not accompanied by a proper description (Hurskainen 2016).
The only exceptions to this pattern seem to be the corpora built to carry out corpus linguistics studies at BantUGent (i.e., the UGent Centre for Bantu Studies) where, for instance, the PhDs of Mberamihigo (2014), Nshemezimana (2016) and Misago (2018) describe the various Kirundi corpora built, or where the PhD of Kawalya (2017) describes the Luganda corpus that he used for his study.The building of a Lingála corpus may be found in the PhD of Sene-Mongaba (2013), reworked and expanded as Sene-Mongaba (2015).Our effort (Nabirye 2016), on which the Lusoga case study presented below is based, is also the result of PhD research undertaken at BantUGent.
With regard to corpus-building efforts for Lusoga, only one exploratory study has appeared so far (Nabirye and de Schryver 2011).In that study, the main focus was on the writing problems that the corpus builder encounters during the transcription of oral material and the implications for the corpus lexicographer when data is extracted from such a corpus.In contrast, of particular interest in the present study will be the parameters/axes that can be used to characterise the composition of a Bantu-language corpus, these being, in addition to oral vs. written, also the distribution of the sources, the periods, the genres and the topics.Orthographic issues will only briefly be recapped here.Furthermore, the value of detailed corpus documentation will be exemplified; this will be done by means of the inclusion of and reference to a comprehensive addendum.Corpus-query software will be mentioned in passing.

2.
The Lusoga language and publications in Lusoga Lusoga is a largely undocumented Great Lakes Bantu language classified as JE16 (Guthrie 1948, Maho 2009).According to the Uganda Bureau of Statistics, 2 062 920 people identified themselves as Basoga in 2002 (UBOS 2006: 12), a figure that grew by nearly half to a respectable 2 960 890 by 2014 (UBOS 2016: 71).While immediately acknowledging that not all people who claim to be Basoga also necessarily speak 'Lusoga', however defined, 1 one should still realise that several million people currently speak Lusoga, of which about two million are monolingual.While it might surprise that a language with up to three million speakers may be largely undocumented, it is fitting to recall that there are even endangered languages with millions of speakers (Adelaar 2014).
Lusoga was first reduced to writing near the end of the 19th century, as pointed out by Condon a century ago: The Basoga Batamba had no written characters.Nor do any writings on rocks or pictorial characters exist.According to native report -and I mean natives of a ripe old age -there never was, as far as they remember, any means whatever of placing down their verbal utterances.All messages from one chief to another were committed to a trustworthy man, who learned the communication by heart, and so delivered the message by word of mouth.It is only within the last 15 years that the language of this people has been put in book form.(Condon 1911: 368) The very first language data for Lusoga may be found in the 'vocabularies' included in Johnston (1902: 980-991) as well as in Condon (1911).However, we have found no evidence to suggest that Lusoga was documented in earnest prior to the 1960s.The earliest reference uncovered so far with an exclusive focus on Lusoga is the orthography of Byandala (1963).That booklet was followed by the documentation of Lusoga proverbs and riddles in Lyavala-Lwanga (1967, 1969).There is no record of Lusoga materials produced during the 1970s or the 1980s.Writing on and in Lusoga was again picked up in the 1990s.The first Lusoga publication in this period was the second version of the Lusoga orthography: Kajolya (1990).It was followed by two attempts at publishing a newspaper, which faltered shortly after: Kodh'eyo  and Ndimugezi (1998-99).From the late 1990s and early 2000s onwards, the main output in Lusoga has come from the Cultural Research Centre (CRC), a religious body based in Jinja (e.g., CRC 1998a, 1999a, b, c, d, e, f, g, h, 2000a, b, 2002, 2005a, Kaluuba et al. 2010, CRC 2011). 2 Also, one very prolific writer is Gulere who, amongst others, self-published ten children's story books, which he placed online in various locations at various times and in various formats (Gulere 2011a, b, c, d, e, f, g, h, i, j).Gulere moreover self-published two translations, one of Antigone, a tragedy by the ancient Greek playwright Sophocles from 441 BC (Gulere 2007a), another of The Bride, a play in English by the Ugandan Austin L. Bukenya from 1987(Gulere 2007b).3 Lastly, a first novel has now been published in Lusoga, written by Kuunya (2011a).

3.
Building a corpus for Lusoga

3.1
Towards an organic (but structured), general-language, synchronic Lusoga corpus The basics of corpus building for the Bantu languages have been described by de Schryver and Prinsloo (2000).The two important concepts that also applied to the building of our Lusoga corpus are that of an 'organic corpus' and that of a 'structured corpus'.An 'organic corpus' has been defined by Atkins, Clear and Ostler as follows: [...] a corpus may be thought of as organic, and must be allowed to grow and live if it is to reflect a growing, living language.[...] In order to approach a 'balanced' corpus, it is practical to adopt a method of successive approximations.First, the corpus builder attempts to create a representative corpus.Then this corpus is used and analysed and its strengths and weaknesses identified and reported.In the light of this experience and feedback the corpus is enhanced by the addition or deletion of material and the cycle is repeated continually.[...] In our ten years' experience of analysing corpus material for lexicographical purposes, we have found any corpus -however 'unbalanced' -to be a source of information and indeed inspiration.Knowing that your corpus is unbalanced is what counts.(Atkins et al. 1992: 1, 4, 6) De Schryver and Prinsloo link this to what they call a 'structured corpus' as follows: Formulated differently, it is any corpus compiler's task to attempt to assemble a representative corpus for his/her specific need(s).Subsequent additions and deletions of sections should be seen as a balancing activity to rectify initial weaknesses, but more importantly, also to take account of and track a growing, living language.As such, there is no such thing as 'the' corpus of a certain language (variety).Rather, at any point in time one selects a certain number of texts from the range of available electronic texts (which might or might not be grouped together into sub-corpora), and uses 'a' corpus for the specific research one wishes to pursue.The minimum requirement for any organic corpus is thus that the corpus compiler(s) will have attempted to put some structure in assembling the range of electronic texts.Within this framework, any first attempt at compiling an organic corpus will at least result in a structured corpus.
(de Schryver and Prinsloo 2000: 92) Our Lusoga corpus is both structured and organic.On the whole, the organicity means that the overall size has increased and decreased over the years.
Corpus building for the Bantu languages is always slightly opportunistic, in that one adds the little existing written material one can get hold of, except when a serious imbalance results.In other words, to get going, one often makes do with an 'imperfect corpus', which is then modified later on, when 'better' data becomes available.Over and above this balancing act, the corpus used should always attempt to be representative of the population that is the subject of the planned description or research.For a general-language corpus, the goal is consequently to acquire as many different genres as possible, that deal with as wide a topic range as possible.Existing written material for all but a few Bantu languages is unfortunately biased in this respect.Most are the result of (modern) missionary activities, so the genre Biblical documents tends to be overrepresented in many Bantu corpora.Conversely, for Bantu languages with a varied, vibrant and ongoing online media presence, the genre Journalism may be overrepresented, and within that, topics such as Sports and Politics.Of course, when the aim is to describe features of biblical works or journalistic texts, then such types of corpora may indeed be 'representative', and when multiple sources have been equally sampled, these corpora may also be 'balanced'.But if the goal is to describe the general language, then an effort needs to be made to achieve both representativeness and balance in another way.It is here that the material found in the oral component of a corpus may bring a solution, as it did for our Lusoga corpus (cf.infra, §3.5.1).
Another important point concerns the time period covered by a Bantu corpus.In all but a few cases, this will be 'the present', with that present optionally stretching back to a number of decades, maximum half a century.Although attempts are being made to build Bantu corpora with time-depths of at least half a century down to a century -such as for Zulu (de Schryver and Gauton 2002), Kirundi (Mberamihigo et al. 2016) andLuganda (Kawalya et al. 2018) -the only Bantu corpus containing substantial amounts of diachronic data that has been built (and used) 4 is the set of corpora for the Kikongo Language Cluster, where some parts are up to four centuries old, while others go back to around 250 years ago (Bostoen and de Schryver 2015).For Lusoga, the aim has always been to build a synchronic corpus covering the general language.Material older than a few decades is in any case extremely rare for Lusoga (cf.supra, §2).When available, it was nonetheless included in an attempt to widen the genre/topic range.

The 0.5m Lusoga corpus
A first Lusoga corpus, of about half a million words, was built as part of the research leading to an MA dissertation.Its composition is as shown in Table 1 (adapted from Nabirye (2008: 70)).(Nabirye 2009), being a monolingual Lusoga dictionary compiled without the use of a corpus.The reasoning at the time was that because the example sentences from that dictionary were the result of original fieldwork, they could as well form part of a Lusoga corpus.A number of reports written in Lusoga (from the Busoga clan leaders, the private sector, academia, etc.) were also added, as was the initial impetus for a true oral part of the Lusoga corpus (i.e., the first few transcriptions of conversations, interviews and songs).
The make-up of this Lusoga 'noun corpus' is as shown in Table 2 (taken from de Schryver and Nabirye (2010: 100)).This version of the Lusoga corpus contained about 870 000 running words (tokens), and about 150 000 orthographically different words (types).Not only the transcriptions of conversations, interviews and songs but also the dictionary examples (together close to a third of the total) could be considered reductions of spoken data to text; the other genres being written texts from the start.
From Table 3 (also taken from de Schryver and Nabirye (2010: 100)) one may further deduce that most sources are recent to very recent, with over 98% produced during the past two decades.

The 1.1m Lusoga corpus
Following the Lusoga noun study, and with the acquisition of more data to compensate for it, the dictionary data was again dropped from the Lusoga corpus.Although based on natural language production, the dictionary examples lacked the original context, and had in any case been 'selected' for their pedagogical value.As such, they did not have their place in a proper text corpus, that is, one that consists of large sections of free-flowing, running text.Instead, the symbolic oral section of about 6 000 tokens in the Lusoga 'noun corpus' was enlarged to well over 400 000 tokens.Furthermore, various texts translated from English, as well as digital-born Lusoga material, were also added, to obtain the corpus that was used for the study of the writing problems in a Lusoga corpus (Nabirye and de Schryver 2011).The composition of that new corpus is as shown in Table 4 (adapted from Nabirye and de Schryver (2011: 123)).This 1.1m Lusoga 'writing-problems corpus' -just as the earlier 0.9m Lusoga 'noun corpus' and the even earlier 0.5m Lusoga 'MA corpus' -was not annotated for any linguistic features.As such, these corpora were not tagged for parts of speech, nor lemmatised.They are known as 'raw corpora'.

The 1.7m Lusoga corpus
The latest iteration of the Lusoga corpus stands at over 1 700 000 tokens and about 200 000 types.The various text files of the 1.1m Lusoga 'writing-problems corpus' were cleaned up, re-assembled and renamed.New material was added for each genre except Journalism.For the latter, however, all the newspa-per clippings were reprocessed with better software (cf.infra, §3.5.2).It is this version of the Lusoga corpus that we will now study in more detail.

Oral vs. written distribution
In contrast to the 0.5m Lusoga corpus, which had no transcribed text, and the 0.9m one with just 5 716 such tokens, a major effort in building the 1.7m Lusoga corpus went to expanding the oral component even further compared to the 1.1m Lusoga corpus.While the model of all modern corpora, the 100m British National Corpus (BNC 1994(BNC -2018)), has set the standard for general-language corpora to contain 10% spoken material vs. 90% written material (Rundell and Stock 1992: 46), we managed to triple this conventional allocation of the spoken part in the total.In all, 216 audio files were transcribed, amounting to well over half a million tokens, as may be seen from Table 5, which corresponds to 31% of the total corpus, illustrated graphically in Figure 1.There is nothing magic about attaining over half a million words of spoken data, 5 nor about reaching a division of a third for oral vs. two-thirds for written data, but for a language which to this date is chiefly an oral language, it simply looked like a necessity in order to ensure that any explanations drawn from this corpus would also reflect real language usage.The oral component is sizeable enough so as to feature in every screenful of concordance lines, where oral and written material is instantly juxtaposed and may be cross-compared to make sure there are no differences between oral vs. written language use that would need to be reported.What is true is that there is an addictive aspect to corpus building, so a goal was set to reach about '100 hours of audio'.Indeed, the 541 129 tokens of transcribed material correspond to exactly 98 hours, 42 minutes, and 38 seconds of audio files.Transcribing half an hour of audio took on average two hours, which means that 400 hours were required for all the transcriptions (not counting the fieldwork and hours spent recording in the first place, nor the many hours to collect and log all the metadata and consent forms).The types of audio recorded and transcribed are varied, and include modern and traditional songs, radio talk shows, traditional ceremonies (as currently being performed), business meetings, interviews and dialogues.

Source distribution
The bulk of the written part of the 1.7m Lusoga corpus was assembled through the digitization of more or less every work, down to every snippet, ever written and published in Lusoga, whether commercially or produced as grey literature.A total of 85 sources were scanned in high resolution, after which the optical character recognition (OCR) tool of OmniPage (1995OmniPage ( -2018) ) was utilised to turn the images into machine-readable texts. 6These 85 sources were good for about 670 000 tokens.OCR was also used to re-digitise large parts of the two shortlived Lusoga newspapers: Kodh'eyo: Busoga etebenkere (Kodh'eyo 1997-98) and Ndimugezi n'omukobere: The factfinder (Ndimugezi 1998-99).Due to the poor quality of the printing of these newspapers, the OCR output required substantial clean-up.The result was about 200 000 tokens of newspaper articles.A further 62 files were obtained electronically.These included self-published works found on the Internet, unpublished material from friends, private e-mail and mailing list communications, translations into Lusoga taken from government, NGO and commercial websites, as well as some religious material found online.All these texts together came to about 260 000 tokens.The translations we ourselves had made over the years, 15 of them, were also added, which contributed a further 25 000 tokens, as well as some of our own writings, six texts with just 2 500 tokens.The remainder consisted of low-resolution images of texts found online, as well as a single hand-written document, which were all retyped, adding another 25 000 tokens.
An overview of these various sources may be seen in Table 6.For a mostly undocumented and oral language like Lusoga, we must admit that we never expected to be able to reach nearly 1.2m tokens of material that had been written in one way or another.Extending the corpus building effort beyond the more obvious transcriptions and OCR, as seen in the last five bullets of Table 6, clearly helped in this regard (and in effect resulted in about a quarter of the written data).

Period distribution
As may be seen from the data presented in Table 7 and the bar chart shown in Figure 2, the 1.7m Lusoga corpus is essentially a synchronic corpus with a timedepth of just over 20 years.Only four files represent the 1940s, 1960s and 1980s. 7The 1990s and 2000s are equally represented, with about 400 000 tokens each, while the 2010s (and only up to August 2013 at that) is represented by as many as 850 000 tokens.While each of the past two periods and the present one cover both oral and written material, up to 70% of the transcriptions concern spoken data from the 2010s, which is the main reason why the 2010s contain more material than any other period.Another is the flurry of primers that were produced in the 2010s, in the wake of the recognition of Lusoga as a medium of instruction in 2005 (NCDC 2006: 5).Even though a strict division between genre and topic is not always possible, and even though some files actually deal with various topics, the data shown in Table 9 may be considered to be a good approximation of the actual topics covered in the corpus.While a quarter of the Lusoga corpus deals with Religion, the inverse also means that three-quarters does not, which is fine given the usual bias in Bantulanguage corpora.The topic Networking actually covers such varied items as newspaper texts, mailing-list messages, songs about networking, and even advertisements.The other topic labels are self-explanatory.The data is shown graphically in Figure 4.While the percentages for each of the broadly-defined topics as seen in Figure 4 may or may not reflect the actual allocation to each of these topics in the way Lusoga is used by millions of speakers on a daily basis in Busoga, what is relatively certain is that the coverage of the range and variation is rather wide in the 1.7m Lusoga corpus.

The orthography in the corpus
Important to observe at this point is that the various orthographies as seen in the original written sources were left intact.Bar a few exceptions, there are no tone markings in the corpus.This implies that the stated number of types (i.e. the orthographically unique words) is always slightly inflated compared to a corpus in which the spelling would have been homogenised.Working with a corpus that contains various spellings for some of the same words is not an insurmountable hurdle; it only means that one is dealing with some (evenly spread) noise as far as the type counts are concerned; the token counts, however, are (mostly) correct.
Although a number of Lusoga orthography guides exist, one must conclude that they did not have much impact on helping the different authors streamline their writing in Lusoga.But then, the majority of the texts which are now in the corpus were not necessarily meant for formal usage, so their authors did not adhere to a strict application of any orthographic rules.For example, biblical prayer books are in-house documents that are only employed for the purposes of religious teaching.The different short stories and the novel in Lusoga have all been produced informally and are written in a style that the authors feel is most appropriate at the time of writing.E-mails and website texts in Lusoga display a severely unregulated use of written Lusoga.Also, the type of written Lusoga found in this category of sources is often mixed with English.In addition, Lusoga is borrowing sounds from neighbouring languages, such as the palatal nasal [ɲ] which is not an indigenous Lusoga sound.One also notices a switch between the voiced labio-velar approximant [w] and the velar fricative [Ɣ]; and the fact that the Lusoga dental sounds are being relegated to neighbouring alveolar sounds (which are easier to pronounce for non-native speakers).Most prominent is an ongoing discussion on whether Lusoga really has a trill [r], only a flap [ſ], or neither of the two -which results in inconsistent uses of /r/ and /l/ in the orthography. 10 Instances In the examples in (1) the author decided to write the dental nasal as /nhy/, the voiced labio-velar approximant as /hw/, and the voiceless palatal plosive as /c/, as well as making distinctions in writing the trill after /i, e/ and the lateral flap after /u/ and /a/.The orthographic problems seen in examples of this nature seem to arise out of a need to use a phonetic-inspired orthography.Such orthographic interpretations may simply be idiosyncratic improvisations made in the absence of a proper (and popular) phonetic description of the sounds of Lusoga.
On the other hand, the examples in (2) reflect a user who is continuously code switching, and missing out on a few basic grammatical forms in the writing system.This is probably due to ignorance or the lack of a proper grounding in writing Lusoga.
The type of issues seen in the two examples can be generalised as occurring rather often in the informal written texts included in the corpus.While the spelling of the original texts was left intact, recognition errors might have been introduced during the OCR process, with some of the letters being machine unreadable and interpreted differently, even though we did our utmost to read through the OCRed material.
It is also probable that some 'errors' were introduced during the transcription process: while we tried to steer away from it, there was a tendency to 'over-correct' misspoken sections and hesitations, as the goal of our corpusbuilding efforts is not to use the material for, say, sociolinguistic studies of detailed turn-taking, but to use the material to uncover language as it was meant to be (Hanks 2012: 416). 11 We do trust that these 'inconsistencies' and 'errors' have not obscured the proper usages of Lusoga.

Querying the corpus
The 391 files of the 1.7m Lusoga corpus are stored as plain text files, and as such this 1.7m Lusoga corpus is also a 'raw corpus'.Raw corpora may successfully be searched using off-the-shelf corpus-query software like WordSmith Tools (Scott 1996(Scott -2018)).WST was indeed used in this way to present the various corpus counts above, and will also be used for the macrostructural and microstructural illustrations in the next two parts of this set of three articles.
However, and as we will explain in Part 2, the 1.7m Lusoga corpus was also part-of-speech tagged and lemmatised for lexicographic purposes.Either or even both of these levels (i.e., the part-of-speech labels and/or the lemmas of each orthographic word) may also be added as tags to all (or part of the) 1.7m tokens of the Lusoga corpus.Software such as WST is able to handle such marked-up text files as well.

Corpus file IDs, corpus filename bibliography and corpus metadata database
As could be seen in examples ( 1) and ( 2), for material excerpted from the corpus, it is good practice to mention the source from which it was taken.In (1) this information was presented following all the examples, and in (2) this was done on the line following the interlinear glossing and translation.In all cases, the corpus details are presented between square brackets.In actual fact, for all material that is quoted from a corpus, whether for lexicographic purposes or more generally in corpus linguistics, three distinct levels of supplementary information may be provided for each source.At the quoted material itself a File ID may be provided, together with 'minimal information', here on whether the treated example is either taken from the written or the oral section of the corpus, and further information on the genre and topic, as well as the year or period, in the following format: The Filename also serves as the entry point to Addendum 1, where further details on each source may be found.The author (or for audio, performer) as well as the title of the work (either as published or as given by us), the number of types and tokens for the work, the source of the work, the place of publication and publisher, as well as the number of pages of the work (or for audio, length of the recording) are all provided in that addendum.The format used for the twelve slots of information in Addendum 1 is always as follows: This type of information includes what one would find in a traditional bibliography (before the first bullet, after the penultimate and last bullets), but adds corpus-specific information to that (all the rest in-between).

in the corpus metadata database
Addendum 1 is an extract from a larger database, which, for the written sources and when relevant, also includes the translator and date of translation, as well as the edition number and year of original publication.For the oral material, that database additionally includes the date of the recording, and the names of the recorders and transcribers.Lastly, for each source the standardised type-token ratio (with a base of 1 000) and the standard deviation thereof are also given. 12A notes field is used for any additional information that needs to be mentioned.This corpus metadata database, which brings together all the metadata of the corpus in a structured format, is available electronically and may be consulted at BantUGent together with the corpus itself.

Discussion
In this article we have given a detailed description of the building of a generallanguage corpus for Lusoga, an under-resourced Bantu language.We showed that it is indeed possible to reach a substantial size, in this case 1.7 million tokens, a third of which consists of oral data, even though the building of this corpus has basically been a one-to two-person effort.This stands in sharp contrast to for instance the ALLEX/ALRI corpora, for which scores of students were sent into the field and as many were enlisted to transcribe the recordings.
Our corpus is an 'organic corpus', as material has not only been added over the years, but some of it has also been taken away, while still other parts were replaced after being reworked.Merely having more data does not necessarily mean one has better data, as one should keep an eye on balance as well.In the overview presented in the present article, the 1.7m Lusoga corpus is a 'raw corpus', in that it has not been annotated; but it was pointed out that with the results from Part 2, part-of-speech tags and/or lemma tags could enrich this corpus linguistically.
We also illustrated the importance of knowing one's corpus, not only in terms of the oral vs. written distribution, but similarly with regard to the distribution of the sources, periods, genres, and topics.Variations on our presentation are of course possible, and indeed in the PhDs of Mberamihigo (2014), Nshemezimana (2016) and Misago (2018) for Kirundi, as well as the PhD of Kawalya (2017) for Luganda, three-dimensional graphs are shown in addition, the third dimension representing the diachronic aspects of their corpora.The point, however, is that a detailed description of a corpus is needed if one is to make intelligent use of it.
As the details in the addendum indicate, we further place particular importance on the metadata of a corpus.Metadata may evidently be put to good use when actually using a corpus: for lexicographic ends, but also far beyond in the wider discipline of linguistics.There are no doubt differences between the spoken and the written forms of a language, and certain phenomena may be realised slightly differently depending on the genre or topic, just as word use differs with register.Likewise, for differences in word use depending on the author or performer, or even the publishing house of a certain work (each with their own style guide and own approach to copy-editing), and so on.Sub-corpora may indeed be assembled along such lines.
Reformulated, depending on how one intends to use a corpus, all the categorisations given so far may play an important role.But they do not inform each study in the same way.Within the field of lexicography, the first two and main uses of a corpus have to do with the creation of the macrostructure of a dictionary on the one hand, and the compilation of the articles in the microstructure on the other.These two topics will now be looked into, and illustrated for Lusoga, in two follow-up studies.
In our work Lusoga, as in all subsequent mentions of 'Lusoga corpus', narrowly refers to the Lutenga variety only (Nabirye et al. 2016).

2.
At the CRC library in Jinja, a substantial amount of grey literature may also be found, either written by the CRC staff itself, or facilitated by them.These works are mostly for internal use, of a religious nature and typically do not have a stated publisher, but may be 'assigned' to the CRC (e.g., CRC 1998b, Kasozi 2000, CRC 2003a, b, c, 2005b, 2008, Wabugoyera et al. 2008, CRC 2010, 2012a, b, c, d, e, f, g).Other religious works often do not have publication years, such as Mwesigwa (s.d.), except for those published by The Bible Society of Uganda, for which, see Endnote 8. Lately, the CRC has begun rejacketing earlier works, including CRC (2009) and Kaluuba and Korse (2010).The CRC also played a pioneering role in producing the first grammars for Lusoga (Korse 1999, CRC 2004, Wambi et al. 2005, Kuunya 2011b), the first bilingual Lusoga-English dictionaries (Korse 2000, Gonza 2007), new orthographies (LULANDA andCRC 2001, 2004), as well as readers (e.g., Gulere and Wambi 2011).3.
At BantUGent a diachronic corpus for Swahili with a time-depth of up to two centuries is under construction.Research articles have not yet been published, however, although preliminary results have been presented at conferences (Devos and de Schryver 2013, 2016). 5.
While not magic, Rundell and Stock (1992: 46) refer to this part of a corpus as the 'Holy Grail': 'Truly spontaneous speech, however -the everyday conversation of ordinary members of the public -has so far been available only in very small quantities and for lexicographers this remains the "Holy Grail".' 6.
In earlier descriptions of corpus building for the Bantu languages, some attention was paid to the type of OCR errors one needs to attend to (de Schryver 1999: 116).Today's OCR software is however so performant that all one needs to remember is that the letter combination read as 'rn' should often be corrected to the single letter 'm'. 7.
Observe that material for the 1980s was found after all, in an academic publication (Cohen 1986), following a memorable search (Nabirye 2016: 25-27).Although eventually published in 1986, this edited material is based on recordings made two decades earlier, in 1966-1967.

8.
A late entrant -in the sense that it came too late to be added to the 1.7m Lusoga corpus (apart from the fact that it may not have been desirable for reasons of representativeness and balance) -is the full Bible in Lusoga, which became available in 2014 (BSU 2014).As is normally the case with biblical works, the full Bible (BSU 2014) incorporates the New Testament (BSU 1998) -published earlier and included in the 1.7m Lusoga corpus.The New Testament itself incorporated the even earlier Gospel of Mark (BSU 1996), which in turn incorporated the still earlier Chapters 4 and 5 of the same gospel (BSU 1994).After the New Testament was released, at least one other edition appeared, with the addition of the Psalms from the Old Testament (BSU 2011).9.
The topic Language mainly includes material about teaching the language of Lusoga and instructional material for Lusoga (written in Lusoga), as well as website texts and journal abstracts on Lusoga (written in Lusoga).10.See Nabirye et al. (2016) for more on these phonetic issues.11.Or, as Kennedy (1998: 82) writes: 'A transcription is an imperfect written approximation of a speech event which exists initially as a dance of air molecules.The level of delicacy or amount of detail in a transcription is [...] related to the use to which the transcription will be put'.12.As defined by Scott (1996Scott ( -2018) ) 'the standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file.By default, n = 1,000.In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus.A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text.(Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)'.

Addendum 1:
Corpus filename bibliography for the 391 sources in the 1.7m Lusoga corpus

Figure 1 :
Figure 1: Pie chart showing the oral vs. written distribution in the 1.7mLusoga corpus

Figure 2 :
Figure 2: Bar chart showing the period distribution in the 1.7m Lusoga corpus

Figure 3 :
Figure 3: Pie chart showing the genre distribution in the 1.7m Lusoga corpus

Figure 4 :
Figure 4: Pie chart showing the topic distribution in the 1.7m Lusoga corpus : Filename | W(ritten) or O(ral) • Genre • Topic • Year or Period]

Table 1 :
Genre distribution in the 0.5m Lusoga corpus

Table 2 :
Genre distribution in the 0.9m Lusoga corpus

Table 3 :
Period distribution in the 0.9m Lusoga corpus

Table 5 :
Statistics for the oral vs. written distribution in the 1.7m Lusoga corpus

Table 6 :
Statistics for the source distribution in the 1.7m Lusoga corpus

Table 7 :
Statistics for the period distribution in the 1.7m Lusoga corpus

Table 8 :
Statistics for the genre distribution in the 1.7m Lusoga corpus

Table 9 :
Statistics for the topic distribution in the 1.7m Lusoga corpus