Corpus-based Lexicography for Lesser-resourced Languages — Maximizing the Limited Corpus

This article focuses on lesser-resourced languages for which only very limited corpora are available and how such relatively small and often unbalanced, raw corpora could be maximally utilized for lexicographic purposes to obtain similar results as for bigger corpora. Sepedi and Afrikaans will be studied in this regard. The aim is to determine to what extent enlarging a corpus from e.g. one to 10 million, and from 10 million to 100 million words enhances its potential for (a) macrostructure compilation, (b) sourcing information on the most important microstructural aspects and (c) the creation of lexicographic tools. It will be argued that valuable and even sufficient data for the compilation of a specific dictionary can be extracted from a relatively small corpus of approximately one million words but that "bigger" in some instances indeed means "better".


Introduction
The days of a default corpus size of one million words such as the groundbreaking first computer-readable general text corpus, the Brown Corpus of Stan-dard American English being regarded as an acceptable norm, are long gone.Currently corpora for major languages typically run into hundreds of millions and even billions of words, for example Google Books with 155 billion for American English, 45 billion for Spanish and 34 billion for British English, and are typically referred to as "big corpora".
In many cases sincere attempts at corpus designs and the compilation of balanced and representative corpora reflecting stratified speaker groups have been made, e.g. in the compilation of the Brown corpus.Different levels of corpus annotation and sophisticated corpus manipulation tools e.g.Sketch Engine, Dante, Interactive language Toolbox, WordSmith Tools and AntConc became the norm as an international standard and represent the typical scenario for major languages of the world.
This article, however, focuses on lesser-resourced languages for which only very limited corpora are available and how such relatively small and often unbalanced, raw corpora could be maximally utilized for lexicographic purposes to obtain similar results in the absence of large corpora.It presents empirical research for Sepedi.English and Afrikaans corpora are used as measurement instruments to determine the power of limited corpora for lexicographic purposes.
"Big corpus" is a relative term.For lesser-resourced languages with a limited number of printed material such as many of the African languages, a corpus of 10 million words can be regarded as a "big corpus".The aim is to determine to what extent enlarging a corpus from e.g. one to 10 million, and from 10 million to 100 million words enhances its potential for (a) macrostructure compilation, (b) sourcing information on the most important microstructural aspects and (c) the creation of lexicographic tools.It will be argued that valuable and even sufficient data for the compilation of a specific dictionary can be extracted from a relatively small corpus of approximately one million words.The question is how much energy should be invested for lexicographic purposes in the maximum utilization of a limited corpus for macrostructural and microstructural compilation versus increasing the corpus size.Macrostructural compilation mainly concerns the compilation of the lemmalist and microstructural aspects include sense distinction, collocations, idioms and examples of usage.

English, Afrikaans and Sepedi corpora
For the purpose of this study corpora for English, Afrikaans and Sepedi were used.For English the Pretoria English Internet Corpus (PEIC) consisting of 12 million words and a subsection of approximately one million words were used.These corpora will be referred to as the 10m PEIC and 1m PEIC respectively.For Afrikaans a small section of the Media 24 archive for the newspaper Beeld consisting of 119 million words as well as two subsections consisting of approximately 10 million and one million words respectively were used and will be referred to as 100m MED 24, 10m MED 24 and 1m MED 24 respectively.
For Sepedi a 10 million-word corpus and a one million subsection thereof were used and will be referred to as 10m PSC and 1m PSC respectively.The corpora and subsections of the corpora are schematically indicated and their exact sizes are given in figure 1: 12,398,893

Macrostructure
In Africa publishers normally restrict dictionaries to a very limited number of pages.5000 articles are often the norm and by necessity put the focus on commonly used words for inclusion in the dictionary.This study thus assumes that the basic/common words of a language are most likely to be looked for especially by learners of the language in such a small dictionary.These are the frequently used words typically marked by means of e.g. a star-rated system, filled diamonds, and/or by a different colour in dictionaries such as the Macmillan English Dictionary (MED), and Collins COBUILD English Dictionary (COBUILD), e.g. car … *** (MED) and car … cars ♦♦♦♦♦ (COBUILD).MED states that a word marked with three stars is one of the most basic words in English.COBUILD, as indicated in table 1, states that the 1,900 most frequently used words in the language, marked with four or five filled diamonds represent 75% of all written and spoken words in English and that the top 14,700 words account for 95% of English words.

Lemmas per category
Totals % of all written and spoken English On the macrostructural level an evaluation was made of frequency lists compiled from the 1m PEIC and 10m PEIC for English, the 1m MED 24, the 10m MED 24 and the 100m MED 24 for Afrikaans, and the 1m PSC and 10m PSC for Sepedi.The most basic words in English indicated with three stars (***) in MED were used as a benchmark against the 1m PEIC and 10m PEIC English corpora.There are 2,275 three-starred words in MED.Of these words 2,203 occur in the 31,982-word frequency list culled from the 1m PEIC; thus an overlap of 96.8%.Since it is hardly feasible for a lexicographer to work through a frequency list of this size when compiling a lemmalist, a more realistic number of words were considered, i.e. 11,559 which occurred five times or more in the corpus.2,061 three-starred words in MED remained, i.e. an overlap of 90.6%.This means that the lexicographer who only had a one million English corpus at his/her disposal, and willing to read through a list of 11,000 words would be in a position to capture 90.6% of the most basic English words.A 90% + figure can surely be regarded as quite a significant achievement on such a small corpus.This experiment was repeated for the entire 10m PEIC.Of the 2,275 threestarred words in MED, 2,272 (only three not: e-mail, long-term and no-one), and with the exception of metre with a frequency of 1, appear in the 10m PEIC.All of these 3-starred words have a frequency count higher than 10 and occur in the 118,202-word frequency list of the 10m PEIC; thus an overlap of 99.9%.Once again, a more realistic number of words were considered, i.e. 11,161, which occurred 65 times or more in the corpus.2,191 three-starred words in MED remained.This means that the lexicographer who only had a 10 million English corpus at his/her disposal, and willing to read through a list of 11,000 http://lexikos.journals.ac.za words would be in a position to capture 96.3% of the most basic English words.Once again, a relatively small corpus of 10 million words enabled the lexicographer to capture the most basic words.It is also significant that a tenfold increase in the corpus size from one million to 10 million only resulted in a 5.7% increase in the three-starred words retained.
Consider table 2    In the absence of a benchmark for basic words such as the three-starred words for English, an alternative approach and criterion for comparison had to be found.This was done through comparison of top frequencies in the 1m MED 24 with those in the 10m MED 24 with 100m MED 24 in order to determine internal stability in terms of top frequencies, or formulated differently, to what extent the top frequencies differ when a corpus is enlarged from one to 10 to 100 million words.The ideal situation would be if the top frequencies were identical as schematically illustrated by the single centre dot in figure 2a.From this table the stability in terms of the top 100 frequencies in the one million corpus versus the 100 million corpus is illustrated.Only 4 items, e.g.92.de, 94. geen, 95.Pretoria and 98. vanjaar in the top 100 ranks of the 100 million corpus do not appear in the top 100 ranks of the one million corpus.Furthermore the actual difference in the rank numbers is very small.So, for example, are the rank numbers for rank 3, i.e. van, 4 het, 5 in, 8 is, 9 nie and 10 wat identical in both corpora.For the top 100 ranks the average variation in rank positions is only 3.1%.For the compilation of a dictionary with approximately 5,000 lemmas in mind, a random cut-off point of the top ranks at approximately 7,700 ranks were made in all three corpora.The aim is to determine which words likely to be looked for by the target user will be missed if only a one million corpus was available instead of a 10 million corpus and only a one million corpus versus a 100 million corpus.7,737 words occur in the one million Afrikaans corpus with a frequency of 11 and more.Compared with the closest match in terms of frequency, 7,734 words occur in the 10 million corpus with a fre-quency of 100 and more and 7,733 in the 100 million corpus with a frequency of 1081 and more.The overlap between these selected sections of the 1m MED 24 corpus' frequency list and the 10m MED 24 corpus is 6,449, i.e. 83.4%.The overlap between these selected sections of the 1m MED 24 and the 100m MED 24 is 5,991, i.e. 77.5%.The question is how significant this presumed 22.5% "loss" is for the compilation of the lemmalist.Among the words occurring with a high frequency are Kersfees 'Christmas', koningin 'queen', toesig 'supervision', eksamen 'exam', koor 'choir', volk 'nation', aardbewing 'earthquake', skandaal 'scandal', digter 'poet', opskrif 'heading', strook 'strip', tjek 'cheque' and gogga 'bug'.The Afrikaans lexicographer would probably regard these words as likely to be looked for and that they deserve a place in the dictionary.For Sepedi the same procedure was followed in order to determine to what extent increasing a one-million word Sepedi corpus to a 10-million word corpus would enhance the quality of the lemmalist, i.e. to see which words likely to be looked for by the target user will be missed if only the 1m PSC was available instead of the 10m PSC.Consequently, the top 7,646 ranks occurring 8 times or more in the 1m PSC were compared to the top 7622 ranks occurring 62 times or more in the 10m PSC.The overlap was 5,553 words, i.e. 72.8%.This means that 2,069 high frequency words in 10m PSC were missed by the 1m PSC.As for Afrikaans, words occurring with high frequency in 10m PSC but not in the top 7,646 of 1m PSC were considered.These words include bjalobjalo 'et cetera', diteng 'contents', seyalemoya 'radio', metara 'metre', semolao 'legal', kamano 'relationship', Bathobaso 'Black people' and komiti 'committee'.Once again it is likely that the Sepedi lexicographer would regard them as common words likely to be looked for and that they should be included in the dictionary.

Microstructure
On the microstructural level the evaluation focused on the value of information drawn from limited corpora in terms of meaning, sense distinction, examples of usage, collocations and proverbs/idioms.Consider as a first example the randomly selected adjective great in Sketch Engine in figure 3. The top 20 combinations of great + a noun in column 1 were then compared to the collocations for great given in MED, the 1m PEIC and the 10m PEIC as given in table 6.There were in total 1,709 occurrences of great in the 1m PEIC and 15,887 in the 10m PEIC.).80% for the 1m PEIC is significant for such a small corpus but a corpus should provide more evidence to the English lexicographer for common combinations such as great fun, great care, great help and great significance, etc. which are under-represented or missing in the 1m PEIC.As a second example the senses of the verb count were studied in the 1m PEIC and the 10m PEIC.The senses distinguished in MED given in table 7 were used as a benchmark.As in the case of the frequency lists, it is not feasible for the lexicographer to read through thousands of concordance lines generated for a specific keyword in context -100-300 lines could be regarded as a reasonable number to consider for detecting senses and to find typical collocations and authentic examples of use.The first deficiency encountered in the 1m PEIC was an insufficient number of concordance lines.For count only 66 concordance lines were found in the 1m PEIC in contrast to 813 in the 10m PEIC.In the 10m PEIC a sufficient number of concordance lines were found for at least four out of five of the senses listed in table 7 but no or insufficient information for all senses, with the possible exception of the first sense to calculate in the 1m PEIC.As for finding authentic examples of use, a one-million corpus proved to be quite significant for commonly used words of the language and as such could go a long way in supplementing the lexicographer's intuition when compiling a relatively small dictionary.Consider, for example, the potential for good examples even for the limited number of collocations great success, great care and great interest in table 6 that can be found in the concordance lines from the 1m PEIC given in table 9

Lexicographic tools
As for the creation of lexicographic tools, the aim was to determine whether a relatively small corpus of one million words can be utilized to create useful tools such as rulers, block systems, indicators of spreading-across-sources, etc.So, for example, the aim was to see whether, in the absence of larger corpora, a one-million word corpus would be sufficient to build a sensible guide for the lexicographer for balancing alphabetical stretches in the dictionary or whether larger corpora would contribute substantially to the refinement of such tools.Prinsloo and De Schryver (2002) introduced the concept of a measurement instrument for the relative length of alphabetical stretches in dictionaries and referred to it as a lexicographic ruler.Such a ruler guides the compiler of a dictionary to appropriately balanced alphabetical stretches in terms of overall length and the number of lemmas treated, i.e. not to over/under treat a specific alphabetic stretch in relation to the other alphabetic stretches.They indicate how, for example, a compiler could enthusiastically treat the first few alphabetic categories exhaustively but 'gets tired' towards the end of the alphabet.Formulated differently, a lexicographic ruler tells the compiler when alphabetic stretch 'A' has been sufficiently treated, i.e. when it is time to move on to 'B'.So, for example, Prinsloo and De Schryver (2003: 110) give a schematic illustration of a ruler for Afrikaans in figure 4.This ruler indicates at a glance that e.g.B, K, O, S and V are relatively big stretches in Afrikaans whilst C, F, J, X, Y and Z are small.Figure 4 also gives a basic indication in terms of percentage of progress through the alphabetic stretches moving from A to Z.For example that M roughly represents the middle of the dictionary and that concluding S means reaching the 80% stage of compilation.They performed a formal breakdown of the ruler into percentages to guide dictionary compilation referred to as a block system.Consider, for example, the block system for Setswana in figure 5.  Rulers are calculated by determining the percentage of words in each alphabetic category from an alphabetic list of words culled from a corpus.This simply means how many words start with a, b, c, … z.The same data is used for calculating a block system but instead of the 26 letters of the alphabet, the list is broken down into 100 sections to each represent 1%.
The question here is whether a ruler compiled from a one-million word corpus could provide a reliable ruler when compared to a 10 million corpus.In table 10 the breakdown of words into alphabetical stretches of both the 1m PSC and the 10m PSC is given.Columns 3 and 5 reflect the percentage breakdown per alphabetical stretch in the 1m PSC versus the 10m PSC and the difference between these percentages is given in column 6.The final column indicates that the difference between the rulers is very small with the difference in all stretches less than 1%.The similarity is visually illustrated in figure 6 where the two lines of the graph are very close to each other.

Conclusion
In this article it has been argued that raw corpora built only from written data, although not reflecting an ideal situation, can substantially assist the lexicographer in the compilation of especially small bilingual and monolingual dictionaries.
On the macrostructural level a corpus of one million words is useful to pinpoint the most commonly used words in the language and would be a useful tool for the lexicographer tasked with the compilation of a relatively small dictionary of approximately 5,000 lemmas.Additional common words will however have to be found.Consider in this regard high-ranking words in the 100m MED 24 mentioned which were not found in the 1m MED 24.The lexicographer will have to find such words through other means, e.g.introspection, field work and reading and marking.If a one million corpus is extended to 10 million words the offering of commonly used words in the top frequency ranks becomes more reliable and represents a gradual enhancement.If the corpus is further extended to a 100 million words, the frequently used words provide a reliable account of the commonly used words in the language and little additional collection is required from the lexicographer for a small dictionary.
As far as microstructural elements are concerned, it is clear that a one million corpus is useful in determining the basic senses of a word as well as typical examples of usage of these basic senses.Such a corpus would typically include a limited number of idioms.Increasing the corpus to 10 million words gradually improves the situation in the sense that more senses are detected, more idioms can be found and more evidence on the use and meaning of such words and idioms is available.
As for lexicographic tools, the results clearly indicate that reliable lexicographic rulers and block systems could be compiled from a corpus as small as one million words.In this case enlarging the corpus to 10 million did not substantially enhance the quality/accuracy of the tool.
In conclusion it could be recommended that the lexicographer should carefully analyse the situation for each specific language.If no written sources are available (s)he should attempt to compile, say, a one-million token corpus of the spoken language.If a limited number of written sources are available, (s)he should try to compile a 10 million corpus and if sources are available in abundance, especially in electronic format, a 100 million corpus will be extremely valuable.

Figure 1 :
Figure 1: Corpora and sub-corpora used for English, Afrikaans and Sepedi

For
the Afrikaans experiment the aim was to see to what extent increasing a one-million word corpus to 10 million and again to a 100-million word corpus would enhance the quality of the lemmalist in terms of the most basic words of Afrikaans.

Figure 2 :
Figure 2: Possible scenarios of overlap in top frequencies

Figure 4 :
Figure 4: A lexicographic ruler for Afrikaans

Figure 5 :
Figure 5: A block system for Setswana

Figure 6 :
Figure 6: A ruler graph for 1m PSC versus 10m PSC The same similarity is observed in the breakdown in the block systems calculated from the 1m PSC versus the 10m PSC in table 11.

Table 1 :
Summary of frequency band values inCOBUILD (p.xiii) as summary:

Table 3 :
Top 100 ranks in 100m MED 24 versus 1m MED 24 This means that 1,742 words, i.e. 22.5% of the selected top section of the 100 million corpus would not have been available for consideration if the lexicographer only had the one million corpus available and 1,285 words or 16,6% if a 10 million corpus was available.

Table 4 :
Comparison of top frequencies in the 1m MED 24, 10m MED 24 and 100m MED 24

Table 5 :
Comparison of the top frequencies in 1m PSC and 10m PSC http://lexikos.journals.ac.za

Table 7 :
Verbal senses of count in MED compared to their occurrence in 1m PEIC and 10m PEIC As a third example, consider three randomly selected Sepedi idioms in table 8: monna ke nku (o llela) teng 'a man is a sheep (he cries inside)', bana ba tau (ga re jane) 'children of a lion (we do not eat each other)' and go sepela ke go bona 'to travel is to see (become experienced)'.

Table 8 :
Occurrence of idioms in 1m PSC versus 10m PSCFrom table 8 it is clear that although in a limited number, these idioms do occur in a one million corpus but the lexicographer is more likely to detect them in a bigger corpus such as the 10m PEIC. .

Table 9 :
Concordance lines for great success, great care and great interest in 1m PEIC

Table 11 :
Sepedi block systems: 1m PSC versus 10m PSC http://lexikos.journals.ac.zaSo, for example, both block systems indicate that the compiler should be at the sub-stretch ID after 30% of the available time and resources for the project, at MA after 50%, SE after 80%, etc.All of the other comparative blocks are alphabetically very close to each other.