New Advances in Corpus-based Lexicography

This article presents various approaches used in corpus-based computational lexicography. A claim is made that in order for computational lexicography to be efficient, precise and comprehensive, it should utilize the method where the corpus text is first analysed, and the results of this analysis is then processed further to meet the needs of a dictionary. This method has several advantages, including high precision and recall, as well as the possibility to automate the process much further than with more traditional computational methods. The frequency list obtained by using the lemma (the equivalent of the headword) as basis helps in selecting the words to be included in the dictionary. The approach is demonstrated through various phases by applying SALAMA (the Swahili Language Manager) to the process. Manual work will be needed in the phase when examples of use are selected from the corpus, and possibly modified. However, the list of examples of use, arranged alphabetically according to the corresponding headword, can also be produced automatically. Thus the alphabetical list of headwords with examples of use is the material on which the lexicographer works manually. The article deals with problems encountered in compiling traditional printed dictionaries, and it excludes electronic dictionaries and thesauri.


Introduction
The use of computers in lexicographical work has gone through various phases, where enthusiasm on the one hand and disappointment on the other have alternated. The calculating power and speed of computers were thought to revolutionise the compilation of dictionaries, and high expectations were held for automating the process. It was thought that text corpora could be transformed into dictionaries with minimal human intervention. 1 In this kind of thinking, two major mistakes were made. It was thought that strings in text would, with minimal modifications, become lexemes and possible dictionary entries. The other mistake was that there was no linguistic insight built into the system. 2 At best this approach resulted in various kinds of concordances where the occurrence of a word or a group of words could be retrieved from text with a needed amount of context, and sorted in selected ways. Much of the usefulness of computers in lexicography was seen just in these terms (Jones and Sondrup 1989;Panyr and Zimmermann 1989). The automatic concordancing was, of course, a huge improvement compared with manual compilation, but there was nothing linguistically intelligent in it. These retrieving programs, often called KWIC (Key Word In Context), continue to be standard tools in dictionary work, but they are suitable only for selected tasks.
Because a good dictionary is much more than a list of words, linguistic sophistication is required from computer-based lexicography. In order for the computer-based lexicographical work to be really meaningful, the computer system used for the work has to acquire and make explicit the linguistic information attached to each of the potential lexemes in the dictionary. These requirements include, inter alia the category of each word (part of speech), -sufficient information for guiding in the use of a word, such as inflection, concordance, tone pattern, argument structure, etc. 3 , -semantic information, including glosses in bilingual dictionaries, 4 -etymological information, 5 and -the commonness of a word (frequency category).
Only fairly recently computational lexicography has come to the level where both realism and know-how make it possible to achieve significant advances (Teubert 2001). Much of the current work is still concentrating on the problems encountered in the lexicography of English and other Western languages. African languages raise different kinds of problems, including complex morphology, tonology, disjoining writing systems, etc., and these have to be faced and solved. A major problem in the computational analysis of language is ambiguity. The extent of ambiguity varies among languages, but in every language it is a problem and needs to be solved. Ambiguity occurs on the morphological level, as well as on the syntactic and semantic levels. A word in isolation may have more than one morphological interpretation. It may have more than one syntactic function, and more than one semantic role, especially several textual meanings.
The computer system designed for lexicographical work should be able to address each of these problems and solve them. This calls for a full computational description of a language, a description that in great detail makes use of linguistic rules and is lexically comprehensive. In other words, the system should be able to analyse unrestricted text of a particular language.
In order to make the subsequent discussion more comprehensible, a description will be given of SALAMA (the Swahili Language Manager), a computer system designed for Swahili, a major Bantu language. Work on the computer description of this language started in 1985, and by now has reached a phase where almost all the problems have at least been addressed, and most of them solved. 6 The system will be briefly described phase by phase, and then by means of examples it will be shown how the system can be applied for dictionary compilation.

Choice of headwords
Data in language dictionaries are usually arranged under headwords ordered alphabetically. Good dictionaries also have sub-entries for listing such lexical words that are either derivatives of headwords or are in some other way closely related to the headword. Lexicographers consider the choice of headwords fairly difficult. 7 Because the final product of dictionary work has to be limited in size, a choice of headwords has to be carried out. Here we will discuss the choice of entries for a general language dictionary, although methods for semi-automatic compilation of domain-specific dictionaries have also been developed. 8 We may think that a large enough and balanced corpus of general language text is a base for such a dictionary, and by retrieving the lemmas of words in the corpus we will get a reliable list of dictionary entries. The task is not so simple, however. We need large amounts of various types of text for the corpus, and we also have to think about its representativeness. A problem with text-based lexicography is that words used mainly in spoken contexts will not be represented in text, and such words need to be considered separately. One method is to use transcriptions of spoken corpora as source for spoken language, but sufficiently large and representative spoken corpora are rarely available.
A systematic and comprehensive analysis of written language starts from the identification and analysis of individual words. More specifically, what we find in text is actually word-forms and not such words we find as dictionary entries. Such word-forms will be analysed morphologically, and each interpretation will be made explicit. Thus the interpretation of many word-forms becomes ambiguous, i.e. the word-form has more than one legitimate interpretation.
The concept of 'word' itself is also not as clear as it seems. In lexicography, we are more interested in grammatical words than orthographic words. Grammatical words fairly closely correspond to concepts, and it is the concepts and their definitions we need to deal with in lexicography. A concept may be represented in text by more than one string of characters. The treatment of such multi-word concepts may already be problematic in counting word frequencies of English (Kilgarriff 1997), but it can be detrimental in languages with a disjoining writing system (Hurskainen and Halme 2001).
Multi-word concepts can be treated as single concepts in automatic processing, especially if their constituent parts do not inflect and if they are adjacent to each other. This can be done by temporarily joining such word clusters together, and in the final version the words can be returned to their original shape. Grammatical words allowing other words between the constituent parts cannot be treated in this simple way, but there are means for treating them too (Tapanainen and Järvinen 1998).
One requirement for a useful system is that it has to be comprehensive. In other words, it should not leave words in text without interpretation, however rare or strange they are. There are two major reasons for this. There should be a 'master dictionary' that contains all the grammatical information of the language, as well as all lexical information. When compiling a smaller dictionary for a specific purpose, it is easier to filter out unnecessary analysed material than to cope with unrecognised (and unanalysed) words. Another reason for comprehensiveness is that in order for a disambiguating program to fulfil the task reliably there should not be unanalysed words in text.
If the text corpus is large and balanced enough, the core vocabulary of the dictionary can be selected on the basis of the lemma list arranged in frequency order. For example, we may think of choosing the 10 000 most frequent lemmas for a dictionary. Except for special purpose dictionaries, it is a good policy to include words in order of frequency in the dictionary. The point where the frequency list will be cut depends on the intended size of the dictionary. This method ensures that at least all common words will be included. This statement sounds trivial, but it is not trivial at all. In the comprehensive computer evaluation of five Swahili dictionaries (Hurskainen 1994(Hurskainen , 2002, it was found that the two most authoritative dictionaries 9 had serious omissions in core vocabulary, although they had a fairly large percentage of words not found in any texts at all. The tests were made with three different corpora, totalling 4 227 362 words. The results show that the monolingual dictionary Kamusi ya Kiswahili Sanifu (KKS) was able to recognize between 89.7 and 91.8% of the words of the three corpora, and Kamusi ya Kiswahili-Kiingereza (KKK) recognized 90.7 to 92.9% of the words. At the same time, both dictionaries listed a number of such words not found in the corpus. Only half the nouns (precisely 50%) of classes 1/2, 3/4, 5/6, 7/8, and 9/10 listed in KKS were found in the corpus. The corresponding percentage in KKK was 55, i.e. it had less 'excessive' words. With verbs the situation was better: 78% for KKS and 85% for KKK.
If we compare these results with Swahili-Suomi-Swahili-sanakirja (Abdulla et al. 2002), which was also tested, we find interesting differences. This dictionary was produced by using a corpus as base for selecting headwords. Its success rate in recognising the words of the corpus ranges between 91 and 94%. In other words, it covers the vocabulary of the corpora slightly better than KKS and KKK. On the other hand, the percentage of 'excessive' nouns of the classes mentioned above was only 24%, and with verbs it was practically zero. In other words, only such verbs also used in the corpora were listed in the dictionary.
These statistics reveal the possibilities of modern language technology to show in detail weaknesses of existing dictionaries, as well as the improvements technology can bring to dictionary compilation.
This lengthy discussion on the problems of selecting headwords for a dictionary reveals that it is a major issue. The use of a frequency list of corpus lemmas is a safe method of avoiding at least major omissions.
The frequency list is, however, not the final entry list of the dictionary. The corpus is rarely so large and balanced that it alone provides all words needed, even for a fairly modest dictionary. Many words used in everyday life are often missing in the corpus, because such matters are not dealt with in texts. Names of flora and fauna are also insufficiently found in texts.

Format of the corpus
It was pointed out above that for the corpus to be maximally useful in dictionary compilation, the linguistic information of the text must be made explicit. Even the first task, i.e. the production of the lemma list, does not succeed in languages with left-branching (prefixing) inflection without a morphological analysis program capable of returning the correct lemma of each word-form. For automatic inclusion of relevant linguistic information needed in a dictionary, the linguistic analyser is an absolute necessity.
Therefore, it is not a question of whether the corpus should be tagged or not, but how and in what phase the tagging is to be performed. Principally there are two methods of tagging, both of them automatic. In one method, which is more traditional, the raw text is tagged with a computer program, and the tagged version of the corpus is then used by the lexicographer as source text. Queries are made to the tagged version, and tags can be used as search keys.
In another method, which basically performs the same operations as the one described above, the lexicographer works with raw text and uses the whole array of programs and utilities in compiling the dictionary. In this method, the user has the raw material (text) and a comprehensive set of tools (programs, utilities, filters, scripts, etc.), which can be used in a number of ways, depending on the type of task.
The latter method is better than the former for several reasons. The user is free to select or prepare their own texts without resorting to tagged corpora prepared by someone else, often for purposes not ideal for the current task. The user also avoids handling of excessively large files. On average, the analysed Swahili text is 16 times larger than the original text, and even after disambiguation it is still 11 times larger than the original. Any editor has difficulties in handling files of this magnitude.
The size problem can be conveniently solved so that the analysis and disambiguation are carried out 'in flight', which means that the user does not even see the results of these phases, because further processing can be carried out in pipe. In lexicography we do not need to see all occurrences of a word in the corpus. We rather want to know in what senses the word occurs in the corpus, and how many times it occurs in each sense. By condensing the format of the information, we do not lose any lexically important information, but the space required for presenting this is cut to a minimum. The larger the corpus, the bigger is the advantage. This method of lexicography requires a working environment, where piping of processes is possible, such as Linux and Unix.

Searching headwords from the corpus
How can the occurrences of a lexical word be found in the corpus? There are currently at least three methods for doing this. Each of these and their suitability for African languages will be briefly discussed below.

Direct string search -traditional approach
In languages with right-branching inflection and derivation, direct string search is not a major problem, because the potential headwords and their inflected and derived forms are adjacent to each other in alphabetical listing. In languages with predominantly left-branching inflection, the problem is more serious, as is demonstrated in (1). Our task is to extract all occurrences of the verb soma (to read). As can be seen, the search string cannot be the whole verb stem, but only the root som, because the verb may also be ending in e or i, and various types of derivative suffixes can be added. Similarly, a large set of (strings of) prefixes has to be taken into account.
(1) Example of string search With the keyword som we are likely to get all the real cases, but also a lot of wrong words. 10 If we try to modify the search string so that wrong hits will be reduced, we run the risk of excluding real cases.

String search with regular expressions
The search is much more accurate if we use regular expressions in formulating the search key. If language analysis tools are not available in dictionary compilation, this is a valuable alternative. It is far more efficient than direct string search, but it is not even nearly as accurate and efficient as the compilation by employing language analysis tools. Instead of using som as search key we have to approach the problem by also trying to describe other elements of the verb that are distinctive enough for separating them from other word categories. As the verb final vowels may be a, e, i and u, this is not a promising approach, because many word categories have similar endings.
A more promising approach is the description of verb prefixes, because there is usually a longer string of characters typical to verbs only. The problem is that there are at least tens of thousands of such grammatical character combinations. Regular expressions, however, make the formulation of such queries possible, even practical. In (2), such a query has been used, and as the result shows, all findings now are verbs.
(2) Example of search by using regular expressions Even this search string is not accurate, because it leaves out the so-called general present tense, subjunctive, present tense negative, infinitive, and several more rare tense/aspect forms. It is difficult, and dangerous, to include such possibilities in the same search key, because the danger of getting unwanted strings will multiply.
Let us modify our previous task, so that instead of searching the verb soma, we look for all occurrences of each verb in the corpus. We cannot use the verb stem as part of the search key now, because there are thousands of verbs, and we do not know in advance what they are. We may try to simulate the verb stem by defining its minimum length. With some verb forms of monosyllabic verbs it is as short as two characters. Unfortunately this is also the length of the stem in many independent relative constructions, and in some it is even three characters. Thus it seems impossible to get an unmixed list of verbs only. Examples of found strings are shown in (3). Verb roots are in bold face. The search found 5,770 verb candidates, and as expected, there were independent relatives and also nouns that fulfilled the search criteria. Some of these are shown in (3). The precision was, however, very good: more than 98%. The recall was much worse. The analysis with SALAMA showed there were in addition 2 659 such words that were unambiguously verbs. Thus the recall was as low as 68%. This could be improved considerably by using search strings, which were excluded above and which could not be included in the same search.

(3) An attempt to retrieve verbs by using regular expressions
The identification of a verb lemma is even more difficult than the identification of a verb. We could think of writing a program that would mark the beginning of a verb lemma for each verb in text. This code could then be used in retrieving the lines. In this way we would get a concordance list where the beginning of each verb lemma is marked. It would then be fairly easy to isolate the correct lemma, although a fairly large amount of manual work would be necessary.

Advanced approach -analyse text first
Although the use of regular expressions facilitates complicated search strings, it is still far from the precision, recall, and ease of the use of an approach where the text is first analysed linguistically. In this method, the following features are made explicit: -The lemma or base form of the word can be defined so that it is identical with the headword of the dictionary. As a consequence, we get a list of words to be included in the dictionary.
-Part-of-speech information is given by the analysis program.
-The program produces a detailed list of morphological features of the word-form found in text.
-Semantic features can be added. For example, the information on animality or humanness, may be necessary for defining the correct concordance pattern. Verbs may also be given information on their argument structure (SV, SVO, SVOO, etc.).
-If the dictionary is intended to be bilingual, semantic glosses in another language can be automatically produced for each dictionary entry. 12 -Syntactic features (subject, object, various roles of verbs, dependent constituents in noun phrases, etc.) can be added. In dictionary compilation, such features are usually omitted.
-Information on the etymology of words can be added.
-Variant, or non-standard, orthography can be reported.

The problem of ambiguity
Word-forms often have more than one interpretation. A word-form may belong to more than one word class. English is a good example of this kind of ambiguity. In Bantu languages, ambiguity is often caused by the fact that the same morpheme is a marker of more than one noun class. Although wordforms may be ambiguous on the word level, in context they normally have only one interpretation. A general rule is that the more comprehensive the analyser is, the more ambiguity the result has. There are two major approaches for solving ambiguity. One method relies on probabilities. If a word-form has two interpretations and one of these is common and the other rare, then the common one is chosen. The result is often correct, but one is never certain whether it is correct or not, because the choice was made on the basis of probability. In another method, ambiguity is resolved with context-sensitive 'linguistic' rules. For the vast majority of cases, contextsensitive rules fulfil the task.
Heuristic rules are used only for cases where there is no basis for constructing a linguistic rule. On the basis of morphological features, such rules try to guess the correct interpretation of the word. For example, if a word begins with m-and ends with -aji, the word is very likely a deverbative noun of noun class 1. It is self-evident that ambiguity can be resolved only in context, i.e. as part of real text.

Removing excessive tags
Experience has shown that the more detailed the analysis of words, the better possibilities it offers for linguistically motivated disambiguation. Therefore, all features should be made explicit in morphological and semantic analysis, because they may be needed in writing disambiguation rules. An example of complexity is provided in (6), where a few word-forms of the verb andika (to write) have been analysed. Note that morpheme boundaries (+) have been manually added, and ambiguity has been removed by rules, so that each form has only one interpretation.

Post-processing of the analysed corpus
When each word in the corpus is analysed and the ambiguity resolved, the result can be manipulated in a number of ways. In dictionary work, we in fact need several kinds of modifications to the result. For the selection of dictionary entries, we need a frequency list according to the lemma. In order for the list to be correct, we need to remove the actual word-form and all such tags that describe inflection, as well as the codes of verbal extensions. By doing this, we may collapse the list in (7) above and get a single line as shown in (9). If verbal extensions are also counted as separate lexical entries as in (8) above, we get a list as shown in (10). Note, however, that if the list is sorted in fre-quency order, the extended forms will not be adjacent to each other.
(10) Counting verbal extensions 3 andika V SVOO 'write ' 4 andikia V SVOO 'write ' APPL 3 andikisha V SVOO 'write ' CAUS When we have a list of words in lemma form we want to be included from the corpus in the dictionary, we sort the list according to the lemmas. The result is the skeleton of the dictionary, and the headwords are arranged alphabetically. The top part of such a frequency list is shown in (11). We note that it is not merely a list of lemmas, because different functions of the same word cause them to be counted separately. For instance, the word na has four different functions, and due to the function of the disambiguation program, we have four different frequencies for this word.
The dictionary itself is ordered according to the headword, and for this reason we have to rearrange the data. We also want to retain information on the frequency of the words. Selected entries from the alphabetically arranged data, extracted from a small section of the news corpus, are shown in (12).

• • •
1 plastiki N 5a/6 'plastic (eng)' 3 plastiki N 9/10 'plastic (eng)' 20 pombe N 9/10 'local brew, beer' 21 ponda V SV 'pound, crush, mash; smash, crash' 17 posho N 9/10 '1 allowance. 2 food, ration' 13 potea V SV '1 be lost. We see that some nouns are used in two different noun classes, and the frequencies of each usage are shown. Inflecting adjectives and non-inflecting adjectives have separate codes, which is necessary information for the dictionary user. Verb types are classified and marked with transitive (SVO) and intransitive (SV) tags. Etymological information, if applicable, is given at the end of the gloss.
In (13), we finally have a form where frequency information has been transformed into classes, the most frequent ones being marked with three dark dots, and the least frequent ones with no dots at all. Some further formatting has also been incorporated, all without manual intervention.

(13) Dictionary entries with frequency classes
awali adv 'first, originally (ar)' •• awali n 9/10 '1 first. 2 origin, cause. 3 above (ar)' •• awamu n 9/10 'phase' azimio n 5a/6 'declaration' azma n 9/10 'intention; desire, purpose' baa n 9/10 'bar, pub. In order to automate the process, we need a third kind of list where the lemmas (i.e. headwords) are attached to the actual word-forms in the corpus. Basically the production of such a list is simple, because it is the default format of the analysis result of SALAMA. The problem is that if we do a selection of lemmas according to frequency, it is not easy to delete the correct lemmas from the original list, because the frequency order there is completely different compared with the lemma list. The solution is to retrieve all such lines from the main list where the lemmas of our selection list occur. As a result, we have a list of only those words we intend to include in the dictionary, and the list also has accurate information on the actual word-forms we can use as key for retrieving examples of use in the corpus.
The search for examples of use can be performed in two ways. One possibility is interactive where the dictionary compiler checks from the corpus the use of each lemma by employing one of several search programs or a more user-friendly interface. The other possibility is to retrieve the needed examples with a program. The resulting file will have all those words in the context, for which we want examples of use. By sorting such lines according to the lemma, we get a list of examples of use in the same order as in the dictionary. It is then fairly simple for the dictionary compiler to select and modify suitable examples of use to be included in the final dictionary. In (14), we have an extract from an alphabetically ordered list of the use of words in context. This list was produced by a program which used the word-form (not lemma) as search key. vyama vingi nchini na kufanikiwa. ingia: *hata_hivyo, wakazi hao wamemuomba *rais *benjamin *mkapa <aingilie> katika hatua hiyo kwa madai kuwa ni ya uonevu. ingia: *mwenyekiti wa *chama cha *wananchi (*cuf) *profesa *ibrahim *lipumba, amewahimiza wafuasi wa chama hicho kujitokeza kwa wingi kwenye maandamano yaliyopangwa kufanyika nchi nzima *jumamosi ijayo na kwamba wawe imara kukabiliana na polisi pindi <watakapoingilia> maandamano hayo.

Conclusion
After a fairly long period of research and testing, computational lexicography has reached a stage where computers and corpora can be put into effective use. For many years, computers have been used for producing word lists with frequencies from a corpus, as well as for retrieving concordances of word use. This article has shown that the use of regular expressions can significantly increase the precision and recall of search. However, the inclusion of the full linguistic analysis in dictionary work brings the work to a level where precision and recall meet high standards. SALAMA, the working environment developed for Swahili, facilitates the testing of various phases in dictionary compilation based on extensive use of the computer. This article demonstrates that computer-based lexicography does not only greatly benefit from the described approach; it is in fact a necessity in working with highly inflectional left-branching languages.
The system brings the automation of dictionary compilation to the point where the benefits of further automation become questionable. It accurately describes what can safely be described, and leaves ambiguous cases for human checking. Its great advantages are morphological accuracy and coverage, great speed, and ease of use.
The system can be developed still further, especially in the area of semantic disambiguation, so that correct senses of words in each context can also automatically be defined. Research is currently concentrating on the problems in this area.

1.
There were also more realistic opinions that reflected the contemporary state-of-the-art in this field (Calzolari 1989; Wegera and Berg 1989).

2.
By linguistic insight we here mean a kind of simulation of linguistic regularities, which a computer system utilizes and translates as 'linguistic rules'.

3.
There has been discussion on the need of sufficient and systematic grammatical information in dictionaries (Salerno 1999). The approach discussed in this article effectively facilitates the inclusion of this feature.

4.
The need of semantic information in dictionaries has increasingly been emphasized, whether in terms of frame semantics (Fontenelle 2000(Fontenelle , 2000a or in terms of some other semantic theory. Statistical methods have also been used for identifying such word clusters that seem to occur together. On the basis of such clusters it is possible to carry out cluster analysis (Watters 2002).

5.
In SALAMA, the Swahili Language Manager, etymological information on words of non-Bantu origin has been included by means of specific tags (Hurskainen 1999).

7.
In fact, according to a survey, the choice of headwords was considered the most difficult among the 13 tasks asked from the team working on the third edition of the Longman Dictionary of Contemporary English (Kilgarriff 1998 10. The strings we wanted to find are shown with √. 11. Alternative strings are separated with a vertical bar and all alternatives are enclosed in parentheses. The question mark (?) stands for optionality, and the plus sign (+) means that the preceding unit may occur one or more times. The set a-z within square brackets means any character. The backslash (\) in the end of the line signifies that for the computer the same line continues.
12. The accuracy of the semantic glosses depends on how they were acquired in the analysis system. The most obvious way not requiring too much manual work is to use an electronic version of a good normal dictionary and include relevant parts of its entries in the dictionary of the analysis system. This was done in SALAMA, and the glosses produced are largely the same as those in the original dictionary, for good and bad. We should not, however, be content with these glosses, because they are just approximations of the various meanings of the lexemes and they should be checked and amended on the basis of the information available in the corpus. In addition to helping in the selection of headwords, the corpus is useful in identifying various meanings of the lexemes.