Compiling Dictionaries Using Semantic Domains *

The task of providing dictionaries for all the world's languages is prodigious, requiring efficient techniques. The text corpus method cannot be used for minority languages lacking texts. To meet the need, the author has constructed a list of 1 600 semantic domains, which he has successfully used to collect words. In a workshop setting, a group of speakers can collect as many as 17 000 words in ten days. This method results in a classified word list that can be efficiently expanded into a full dictionary. The method works because the mental lexicon is a giant web organized around key concepts. A semantic domain can be defined as an important concept together with the words directly related to it by lexical relations. A person can utilize the mental web to quickly jump from word to word within a domain. The author is developing a template for each domain to aid in collecting words and in describing their semantics. Investigating semantics within the context of a domain yields many insights. The method permits the production of both alphabetically and semantically organized dictionaries. The list of domains is intended to be universal in scope and applicability. Perhaps due to universals of human experience and universals of linguistic competence, there are striking similarities in various lists of semantic domains developed for languages around the world. Using a standardized list of domains to classify multiple dictionaries opens up possibilities for cross-linguistic research into semantic and lexical universals.


The problem (It's going to take forever)
The mental lexicon is far larger than either the grammatical component or the phonological component in a person's linguistic competence.Investigating and describing it is the largest and most time-consuming task in descriptive linguistics.With perhaps 6 000 languages in the world and perhaps 20 000 words in each, we need to collect and describe something on the order of 120 000 000 words. 2 The major languages of the world often have several large published dictionaries available to them.The major publishing companies can afford to hire scores of professional lexicographers to compile massive text corpora and do the research necessary to produce quality dictionaries.But for minority languages the picture is far bleaker.With few or no published texts, few or no professional lexicographers available to them, and little or no funding, the minority languages face a daunting challenge.
I have been involved in the production of dictionaries for minority languages since 1985 and have taught lexicography seminars to train others in the process.I estimate that linguists working in a language development project add words to their lexical database at the average rate of only 650 words per year, or about 2.5 words per working day. 3 At this rate it frequently takes 20 years to produce even a modest dictionary.For many years I have been concerned about this abysmal rate of progress and have attempted to find ways to make the process of compiling a dictionary simpler and more efficient.If we are ever going to finish the task of documenting the world's languages, we need a mass production technique.

The journey (Searching for a solution)
For several years colleagues within SIL, together with other interested scholars, have discussed ways in which we could leverage the linguistic similarities among the Bantu languages to facilitate linguistic investigation and language development within the Bantu family.We have called this movement the 'Bantu Initiative'.In September 2000 the Bantu Initiative asked me to begin work on a dictionary template, including the production of a list of semantic domains that could be used to classify Bantu language dictionaries.I was a bit sceptical, since I had heard from numerous sources that the semantic category systems of the world's languages were vastly different, and even varied from individual to individual.But since the Bantu languages are closely related, I thought it was worth a try.
In order to construct a list of domains for Bantu languages, I needed to know how Bantu peoples categorized the words of their languages.So in December 2000 and January 2001 I held two workshops 4 for Gikuyu and Lugwere 5 in which I asked 12 speakers of each language to sort and group a list of 1 000 words chosen from a wide variety of semantic domains.I was curious to see how non-westernized peoples would classify the words of their language.My expectation was that they would set up very different domains than an English speaker.They didn't.Their domains were strikingly similar to other lists of semantic domains that I had collected from around the world.As I compared the lists, it became apparent that the universality of human experience and some sort of universal linguistic competence resulted in similar classification systems.The differences came from minor differences of culture and the necessity to squash a multi-dimensional system of relationships into a two dimensional list.So I decided (perhaps presumptuously) to attempt to compile a universal list of semantic domains.
The challenge was to compile an exhaustive list of domains that could be used for any language in the world.None of the lists I had were complete.All were designed for a particular language and purpose.For instance, the Outline of Cultural Materials (Murdock et al. 1987) presents a list of anthropological domains, but is missing many lexical domains.Roget's Thesaurus (Roget 1958) has 1 000 domains, but due to its purpose it also omits many domains.Newer editions of Roget's (e.g. Morehead 1985) contain 600 major domains and thousands of smaller entries.Neither presentation is suitable for our purpose.Louw and Nida (1989: xix) admit that their list is uneven due to the subject matter of the New Testament.Recent semantically organized dictionaries such as the Longman Language Activator (Summers 1993) and the Oxford Learner's Wordfinder Dictionary (Trappes-Lomax 1997) are highly selective in the domains they include.So I concluded that a new list was needed.I contrasted and compared all the lists at my disposal, ensuring that every domain in every list was covered by a domain in my list.As I studied the organization of the lists, more and more similarities began to emerge.There was a logic to the domains, and a logic to how they were organized.
I knew from the beginning that a list of semantic domains could be used to collect words.Eliciting vocabulary has been a topic of interest for some time, and the literature contains a wealth of practical suggestions, such as using lexical relations (Beekman 1968: 4), concording a text corpus (Naden 1977: 14), and using semantic domains (Newell 1986: 20). 6I decided to try it out and see just how easy it would be.I took the semantic domain 'Bodies of water' and started listing words that belong to the domain (e.g.ocean, lake, river, shore, wave, etc.).In fifteen minutes I had collected and subcategorized 169 words.The rate for collecting words had just jumped from 2.5 words/day to 11 words/minute.I realized that all I needed was a list of semantic domains and I could collect the words of a language in a matter of days rather than years.
As I thought about how the list of domains could be used to collect words, I realized that a simple domain label, such as 'Bodies of water', would not be adequate.Three things were needed: (1) a simple statement of the central idea of the domain, (2) elicitation questions that would prompt a person to think of words that might belong to the domain, and (3) sample words from English. 7 I have tested the materials and method in three workshops.The first test, held in May 2001, used a beta version of the semantic domains list with a group of fifteen speakers of the Lugwere language.In ten days, the participants collected over 10 000 words and 1 000 example sentences. 8In January 2002, 30 speakers of Lunyole used version one to collect 17 000 words in ten days.In February 2002, 12 speakers of Kitharaka 9 collected 12 000 words in eight days.In the months since the workshops, speakers of each language have been editing and glossing the word lists.As the result of a few months work, we expect to have a classified dictionary in each language of over 10 000 words, including part of speech, noun class, the plural form of each noun, and a simple gloss.The chart below compares the historical average rate of progress with the results of the three workshops.The field of semantics has yet to reach a consensus on the nature and validity of semantic domains and semantic primitives.'Semantic domain' is just another http://lexikos.journals.ac.za way of saying 'area of meaning', but the notion that a meaning occupies an area is obviously figurative.Wierzbicka (1996: 170) comes close to endorsing the notion of universal semantic domains when she says: "The idea that words form more or less natural groupings, and that at least some of these groupings are non-arbitrary, is intuitively appealing, even irresistible" (emphasis added).She also indicates that domains vary in their nature from "self-contained fields of semantically related words" to "irregular and open-ended networks of interlacing networks".The question remains -just what is a semantic domain?I envisioned that the list of domains would serve several purposes.It could be used to collect words, it could serve to classify a dictionary, and it could aid in semantic investigation.In order for it to be an effective tool in collecting words, I felt I should list sample words from English that belong to each domain.As I analyzed the words that I was listing under each domain, and compared them to the words others had included in the same domain, I began to see patterns.Some domains consisted of a generic term, such as 'Game', and a list of specifics: chess, checkers, charades, monopoly.Others were based on the Whole-Part lexical relation, such as 'Head' and eye, nose, mouth.Other domains included a variety of words related by different lexical relations, such as 'Wave' and tidal wave, crest, break, roar, surfboard.

Rate of Progress
It became apparent that a semantic domain was really some important concept and all the words directly related to it by some lexical relation.The words of a language are all linked together in the mind in a gigantic multidimensional web of relationships.But these mental links tend to cluster around a central nexus.A semantic domain isn't so much an area of the web as it is one of these central hubs.One of the intriguing questions about these hubs is: What is their relationship to semantic primitives?Many domains appear to be based on semantic primitives or a combination of two or three primitives (e.g.'Bad behavior' = do + bad; 'Parts of things' = part (of) + something).Many are headed by high frequency words which constitute the core vocabulary of a language.
Several recently published dictionaries employ a "defining vocabulary".For instance, the Longman Language Activator (Summers 1993) lists the 2 222 words of its defining vocabulary in an appendix.When one excludes the functors (e.g.the, to, of), what is left is very similar to a list of domains.The notions of "semantic domain", "semantic primitive", "core vocabulary", and "defining vocabulary" seem to be converging.
As I developed the list, I began organizing the sample English words into lexical sets.I found that each lexical set was related to the central idea of the domain by a single lexical relation.I have already mentioned that lexicographers recommend that we employ lexical relations in collecting words.This seemed like a very useful idea in the light of what I was discovering.However, lexical relations are very hard to grasp in the abstract (e.g.Conv 13 (buy) = sell (Grimes 1987: 27)).Grimes (1994) has attempted to make lexical relations more user-friendly.But there are so many of them 10 that it is extremely inefficient to fields " (1996: 183).I would agree, and add that the study of semantic fields is necessary for the study of semantic primitives and universals.
The existence of the International Phonetic Alphabet permits cross-linguistic comparisons of phonological systems.The existence of (fairly) standardized grammatical categories allows us to search for universals of grammar.Anthropology has the Outline of Cultural Materials (Murdoch 1987).Chemistry has the periodic table.What does semantics have?I suggest that we cooperate to produce a standardized list of semantic domains.Such a list would enable us to do cross-linguistic comparisons and search for linguistic universals in the field of semantics, just as our colleagues are doing in the fields of phonology and grammar.What I have done is only a poor first attempt in this direction, but I hope it will lead to productive avenues of research. 11Endnotes 1.
SIL International (the Summer Institute of Linguistics International) is an organization of volunteers, devoted to the promotion and development of minority languages.SIL International works in over 50 countries and over 1 000 languages.

2.
In the interests of simplicity and naturalness, if not accuracy, this article employs the term 'word' to refer to lexical items of all sorts, including roots, derivatives, compounds, idioms, and phrases.

3.
This estimate is based on observation of the number of years it has taken to produce published dictionaries, both within and outside of SIL, and has been confirmed by numerous SIL colleagues.

4.
Thanks are due the Bantu Initiative for funding these workshops.

5.
Both languages are Bantu.Gikuyu is spoken in Kenya, and Lugwere in Uganda.Dr. Mary Muchiri of Daystar University organized the Gikuyu workshop, and Dr. Ruth Mukama of Makerere University the Lugwere workshop.

6.
Ideally lexicographic research should utilize both semantic domains and a concordance.
However, unless a computerized text corpus running into the millions of words is available, using a list of domains is the only effective way of collecting words.If no corpus is available, it would be good to begin collecting or producing one.

7.
These materials are currently being translated into Swahili, and plans are to have them translated into French, Spanish, Chinese, and other major languages of the world.

8.
By comparison many bilingual dictionaries are published with only 3 000-5 000 entries.

9.
All three languages are Bantu.Lugwere and Lunyole are spoken in Uganda, and Kitharaka in Kenya.
10.In fact, there are far more than the literature would suggest.It is apparent that lexical relations are not all the same sort of thing.I believe that lexical relations are based on similarities of meaning, and are as varied as the meanings of words.
11. Copies of the author's list of semantic domains and related materials are available from him via email at ron_moe@sil.org.The materials are also available in Swahili. http://lexikos.journals.ac.za