A Perspective on the Lexicographic Value of Mega Newspaper Corpora — The Case of Afrikaans in South Africa

The aim of this article is to assess the potential use of a mega newspaper corpus, the Media24 archive, in the absence of large balanced and representative corpora, for the compilation of major general dictionaries for Afrikaans. Firstly, an evaluation of Media24 against the lemmalists of both a major single-volume and a multi-volume monolingual dictionary for Afrikaans is undertaken to determine to what extent Media24 correlates with the lemmalists of major dictionaries. Secondly, the strength/suitability of Media24 for lemma selection in categories other than newspapers is evaluated. Finally, it is determined what the contribution could be of Media24 to lexical sense distinction, selection of examples of usage, and typical collocations.


Introduction
The aim of this article is to assess the contribution that a mega newspaper corpus, in the absence of large balanced and representative corpora, can make to dictionary compilation of major general dictionaries for Afrikaans.
Afrikaans lexicography finds itself in a situation where (a) a number of excellent major dictionaries are available (not traditionally based on corpus material), (b) no large balanced and representative corpora exist, but (c) a mega newspaper archive estimated at 1 000 000 000 (a thousand million tokens) can be consulted.
In this article, the Afrikaans Media24 archive is subjected to three tests in order to determine its effectiveness for the compilation or review of major Afrikaans general dictionaries.
The first test is an evaluation of the newspaper corpus on a macrostructural level against the lemmalists of a modern, major monolingual dictionary for Afrikaans, i.e. the 5th edition of the Handwoordeboek van die Afrikaanse Taal (HAT) and the 4th volume of the multi-volume Woordeboek van die Afrikaanse Taal (WAT).The intention is to establish to what extent this media archive can be used as a source for inclusion versus omission of lemmas in the revision (or compilation) of major Afrikaans dictionaries.Firstly, lemmas in HAT and WAT are compared to Media24 in order to determine to what extent a word list culled from Media24 matches the existing lemmalists of a single volume of a major dictionary such as HAT and a multi-volume comprehensive dictionary equal to the WAT.Secondly, an attempt is made to determine to what extent a word list culled from Media24 is suitable as an aid to inclusion or omission in future versions of these dictionaries.The question is, therefore, whether such a word list indicates what lemmas could be added to current lemmalists and whether non-occurrence could suggest the need for the omission of certain lemmas from existing dictionaries.Finally, it is suggested that frequency counts over three decades can assist the lexicographer to decide on inclusion or omission.
The second test evaluates the suitability of Media24 for lemmas most likely to be looked for by the target users of a general dictionary in categories not intensively covered by newspapers, for instance, religion, skills, hobbies, government, house organs and fiction which is covered in the BROWN/LOB corpora but collected as separate categories (cf.Table 1 below).Newspapers report on these fields, but the question is whether the coverage of such items is sufficient for lexicographic purposes.The purpose is therefore to determine to what extent terms from these fields, which can be presumed not to be generally associated with newspaper reporting, are covered by the Media24 newspaper archive.The randomly selected categories are gardening, quilting and embroidery.The last two contain precise subject specific terminology and therefore pose an implicit challenge in terms of coverage by a general newspaper corpus.
The third test, on the level of the microstructure of dictionaries, aims to determine the value of Media24 as an aid in sense distinction, selection of examples of usage, and typical collocations.The question is whether a presumed bias towards typical 'newspaper senses' versus more 'general senses' impedes the value of Media24 in comparison to general corpora.
A brief description of WAT, HAT and Media24 will be given, followed by a calculation of the size of the Media24 archive.

Balance and representativeness as essential but problematic aspects in corpus creation
The debate as to what entails valid/ideal/balanced/representative corpora and whether it will ever be possible to compile such corpora is ongoing (cf.Biber 1993, Summers 1993, Kilgarriff 1997, Kennedy 1998, Kruyt and Dutilh, 1997, Otlogetswe 2007and Atkins and Rundell 2008 for detailed discussions).A few excerpts serve to illustrate these lexicographic concerns.
Questions associated with 'representativeness' and 'balance' are complex and often intractable.(Kennedy 1998: 62.) A general corpus is typically designed to be balanced, by containing texts from different genres and domains of use including spoken and written, private and public […] For a corpus to be 'representative' there must be a clearly analysed and defined population to take the sample from.(Kennedy 1998: 20, 52.)What we mean by representative is covering what we judge to be the typical and central aspects of the language, and providing enough occurrences of words and phrases for the lexicographers […] to believe that they have sufficient evidence from the corpus to make accurate statements about lexical behaviour.(Summers 1993: 186, 190.) […] to be representative of general language.This is a bold ambition -some say one that is impossible to fulfil.(Summers s.d. [1996(Summers s.d. [ -1998]: 6.) ]: 6.) COBUILD have always insisted that it is impossible to create a corpus that is truly representative of the language, and have focused on size of corpus rather than balance.(Kilgarriff 1997: 150.)Lexicographers traditionally aim at a 'representative' or 'balanced' corpus, that is, the corpus should be appropriate as the basis for generalizations concerning the language as a whole.(Kruyt and Dutilh 1997: 230.)Scholars even differ in their interpretation of the terms.This debate, however, is beyond the scope of this article -the issue at stake here is simply whether a 1 000 million-word newspaper archive can be regarded as a suitable, main source for the compilation of major Afrikaans dictionaries.
The design of a pioneering corpus, such as the Brown Corpus of Standard American English and Lancaster-Oslo/Bergen Corpus (LOB), was a carefully compiled selection of American English, totalling approximately a million words drawn from a wide variety of sources for which each contained 2 000 words.The corpus was sampled from 15 text categories given in Table 1.MacLeod and Grisham (2000) for the case of adding a vast amount of newspaper data to the Brown Corpus).They indicate how an increase in the Brown Corpus of 1 329% (thus more than thirteen times) resulted in a skewed or inadequate corpus e.g. in the representation of business-related words, such as sell, rise, buy, pay, and increase.Newspaper texts also contain words belonging to a slightly higher register; cf.arts (instead of dokter/geneesheer) 'doctor', baar (kraam/geboorte gee) 'give birth'.On the other hand, these texts also contain words belonging to an informal register; cf. for example herrie (oproer/rusie/ontevredenheid) 'uproar/quarrel/dissatisfaction', grondgryp (grondonteiening) 'land seizure', and blaser (skeidsregter) 'referee'.Both these types are uncommon to everyday written and oral communication.The use of such words on newspaper banners or in headlines contribute to attracting attention and are usually shorter than their equivalents, fitting into limited space.They are apparently much less frequently used in non-newspaper corpora.Preliminary tests indicate that herrie is used ten times more in Media24 than in a 4 million-token test corpus consisting of Afrikaans literary works.Likewise no occurrence of blaser referring to a referee could be found in the test corpus.The real potential corpus-skewing factor of such words should however be determined by more detailed studies.
It could also be argued that a growing newspaper corpus, such as Media24, partially qualifies for what Atkins calls an organic corpus, at least as far as the 'growing part' is concerned.
A corpus builder should first attempt to create a representative corpus.[…] the corpus is enhanced by the addition or deletion of material [...] This is the way to approach a balanced corpus.One should not try to make a comprehensive and watertight listing […] rather, a corpus may be thought of as organic, and must be http://lexikos.journals.ac.za allowed to grow and live if it is to reflect a growing living language.(Atkins 1997, personal communication at Salex'97 (Atkins et al. 1997.)) Building neatly designed corpora, such as the Brown corpus, was also envisaged for African languages and Afrikaans when corpus creation for these languages commenced in 1990.For the African languages, it was not possible, because many of the categories, such as the three press sections, simply do not exist as most of the languages do not even have a single newspaper and some in fact have very limited printed matter.For these languages, a more organic approach (cf.Atkins et al. 1997) was followed.For Afrikaans, the situation was more conducive, but no attempt was ever made to build a large corpus, for example, along the lines of the Brown/LOB design.An organic corpus of 10 000 000 tokens was compiled at the University of Pretoria but this corpus is dwarfed by the Media24 newspaper archive estimated at more than 1 000 million tokens.
The question remains, however, as to what extent growing in size also means growing in representativeness or, in what Leech terms its diversity.
The value of a corpus as a research tool cannot be measured in terms of brute size.The diversity of the corpus, in terms of the variety of registers or text types it represents, can be an equally important (or even more important) criterion.
(Emphasis in the original.)(Garside et al. 1997: 2.) Regardless of the corpus size, a corpus that is systematically selected from a single register cannot be taken to represent the patterns of variation in an entire language; […] corpora representing the full range of registers are required.[…] it is important to design corpora that are representative with respect to both size and diversity.However, given limited resources for a project, representation of diversity is more important for these purposes than representation of size.(Biber 1995: 131.)What is important, therefore, is to estimate the value of the Media24 archive for Afrikaans lexicography.Is its 'brute size' also representative of the varieties of registers or non-newspaper categories in, for example, the design of the Brown Corpus?

The Media24 archive
The Media24 archive is a searchable database of Afrikaans media reports available at http://152.111.1.251/cgi-bin/s.cgi.Media24 contains among others the newspapers Rapport, Beeld, Volksblad and Die Burger, available in electronic format for the past two to three decades.A range of search functions such as basic words, fixed and semi-fixed phrases as well as the use of certain Boolean operators are allowed.Hits are presented as full reports as they were published in the newspapers, up to a maximum of 50 at a time.It also means that a report can contain more than one occurrence of a word.As for the size of Media24, no authoritative figure is available.Evaluation of Media24 frequencies is therefore difficult if the size of the corpus is unknown.An attempt was made to calculate its approximate size in a simplistic way before comparisons with HAT and WAT were made.
A random selection of 18 words was chosen for the calculation of the size of the Media24 archive given in Table 2. Statistics used for this calculation were First, the relation between the number of media reports in which a specific word occurs and the total number of occurrences of the word in all reports in the 750m subcorpus was calculated; Column 4. In the case of die, for example, the number of reports is only 6% of the total counts for die, i.e. die occurs very frequently in each report (more than 50 million times in less than 3 million reports).This relation was then used to calculate the total count of each word in the entire Media24 archive based on the number of reports in the entire archive; Column 6.
A basic correlation value between counts in the 750m subcorpus and the total size 750m was then calculated for each word in the 750m subcorpus by dividing 750m with the total counts for each word; Column 7.
This correlation value was finally used to calculate the size of the Media24 archive by multiplying it with the calculated total counts in Column 6.
Thus for all of the 18 keywords, a corpus size slightly exceeding 1 000 000 000 (one thousand million tokens) was independently postulated; Column 8.

HAT and WAT
HAT is the 5th edition of the Verklarende Handwoordeboek van die Afrikaanse Taal containing more than 50 000 lemmas.WAT, Woordeboek van die Afrikaanse Taal, is a multi-volume explanatory dictionary currently published up to the letter R (13 volumes).

Comparison of types in the Media24 archive to the lemmalists of HAT and WAT
For the first test, a random sub-stretch of the arbitrarily selected alphabetical stretch 'I' was selected i.e. ideaal to idioot.There are 153 lemmas strictly alphabetical 1 in this stretch in WAT and HAT taken together.WAT has 147, HAT 48 and they have 42 in common.
From these comparisons, it is clear that the Media24 archive not only covers all frequently used lemmas in the dictionaries but also a significant number of low frequency lemmas.Some lemmas in the dictionary with zero occurrences in the archive could therefore be considered for omission in a forthcoming revision of the dictionary.Likewise, certain words in the archive could be considered for inclusion in the dictionary (given general considerations for lemma inclusion such as the self-explanatory nature of some morphologically complex words), such as identiteitloos (67) 'without identity', identiteitloosheid (21) 'state of being without identity', identiteitlose (62) 'being without identity', identiteitsboek (164) 'identity book', identiteitsboeke (125) 'identity books', identiteitsboekie (409) 'small identity book', 'identiteitsboekies (323) 'small identity books', identiteitsfoto (20) 'identity photo', identiteitsfoto's (48) 'identity photos', identiteitsnommer (690) 'identity number', identiteitsnommers (255) 'identity numbers'.These words could be lemmatised as identiteitloos, identiteitsboek, identiteitsfoto and identiteitsnommer.In order to gain an impression of their frequency trajectories over two decades, the total counts of these words are expressed per 50 million tokens for five-year periods ending in 1989, 1994, 1999 and 2003 2  respectively in Table 3 and graphically illustrated in Figure 1.The frequency counts in Table 3 and the trajectories in Figure 1 suggest inclusion of these lemmas in major Afrikaans dictionaries.

The repetition factor in Media24
Frequency counts of words in a media corpus can be questioned on the basis of potential repetition of the same phrases in, for example, regional releases of the same reports or stereotypical repetitions of a word/phrase.In order to determine the extent and nature of repetition in Media24, concordance lines were generated for randomly selected words given in Table 5.For example, up to 53 repetitions of the line "… Die woorde hieronder kom voor in die blok met letters …" '… The words below occur in the block with letters …' occur as indicated in Table 4.The total number of concordance lines generated for each word from reports in Media24 were grouped and summed to determine the number of duplications.Consider the following instances of repetition of fixed phrases containing the word blok 'block' in Table 4.

Evaluation of the medi
The third test aims to determine whether concordance lines culle on a microstructural level.The focus is on the contribution towards sense distinction, authentic examples and collocations -all typically regarded as areas where the corpus gives valuable support to the lexicographer in compilation of the article (cf.De Schryver and Prinsloo 2000).The polysemous words borduur, steek, patroon, knop and blok were selected from the Quilting and Embroidery list and their treatment in HAT and WAT as well as their use in context in Media24 was studied.

T
of blok in WAT and HAT and occurrences in M ia24 nce sugd that in the current situation where no large designed cors exists, the Media24 archive is an excellent substitute.In fact, able 9: Senses ed Media24 occurrences were found in support of six of the nine senses given in HAT and eight of the 26 senses given in WAT.Once again the evide gests that the Media24 archive could be a sufficient tool for the compilation of a major dictionary but insufficient as sole corpus for the compilation of a dictionary of the magnitude of WAT.

Conclusion
It can be conclude pus for Afrikaan its value goes far beyond a limited component of a corpus design pattern, i.e. 'press'.The Media24 archive is so vast and versatile that it can be regarded as a world of information in its own right and its success in terms of broad coverage can probably be attributed to the fact that virtually all aspects of modern life in South Africa are covered in the daily reporting of these newspapers.
The question could even be asked whether the stage has not been reached in corpus-based lexicography where media coverage is so comprehensive in reporting on all spheres of everyday life that mega newspaper corpora have indeed become a world in one medium, i.e. a corpus sufficient, or at least going a long way as a basis for the compilation of general dictionaries.
ilised by Prinsloo and Gouws (2006) to express the increasing number of tokens 5.

Dictionaries and corpora
million-word subsection of the corpus (exact size: 749 553 152 tokens); Column 2, (b) number of newspaper reports containing each of these 18 words in the 750m subcorpus; Column 3, and (c) number of newspaper reports in the entire archive containing each of these 18 words; Column 5.

Table 1 :
Design of the Brown and LOB corporaIt could be assumed that newspaper texts represent a specific, almost homogeneous subtype that can easily skew a balanced corpus if newspaper texts are added in large quantities (cf.

Table 2 :
Calculation of the size of the Media24 archive http://lexikos.journals.ac.za

Table 3 :
Total counts (lemmas and derivations) expressed per 50 million tokens in Media24

Table 4 :
Most frequent repeated co lines for blok in Media24Concordance lines for six more randomly selected words were generated, Table5).The final column of Table5indicates that the percentage of duplication for these words range from 5%-14%.

Table 5 :
Duplication factor of words in Media24Reports are repeated in sister newspapers, regional issues of the same newspa-

Evaluation of Media24 for categories gardening, quilting and embroi-
Die Burger 30 December 2000, 2 October 2000 and  Volksblad 29 September 2000).From a strict statistical point of view, it could be argued that these are repetitions skewing frequency counts.However, from a lexicographic perspective, a case could be made for bona fide use in multiple sources over large and different geographic areas, e.g.Die Burger, mostly southern regions of South Africa, Beeld mostly northern regions, etc., i.e. not true/basic repetition.
per or sequentially over a period of time.So, for example, concordance lines generated for ideologies 'ideological' rendered a number of identical lines for ideologies in the context Die "ou manne" wat op alle samelewingsvlakke fanaties, krampagtig en ideologies apartheid bedink en bevorder het, moet gekonfronteer word (The 'old men' who on all levels of coexistence, fanatically, desperately and ideologically conceptualised and promoted apartheid should be confronted)(Beeld 29 December 2000,

a corpus on microstructural level d
from a major media corpus could provide sufficient aid to the compiler of a major dictionary ww.(geborduur) [...] 1 Met naaldwerk versier: Blomme op 'n kussing borduur. 2 (fig.)Op oordrewe wyse opsier: 'n Verhaal borduur met romantiese verdigsels.As in the case of lower ranking Gardening Keys, these Quilting and Embroidery Keys lemmatised in WAT and HAT are not supported by Media24 texts.

T. Sue and Michael Rundell. 2008
. The Oxford Guide to Practical Lexicography.Oxford/ rd University Press.