Corpus-based Lexicography for Lesser-resourced Languages — Maximizing the Limited Corpus

D.J. Prinsloo


This article focuses on lesser-resourced languages for which only very limited corpora are available and how such relatively small and often unbalanced, raw corpora could be maximally utilized for lexicographic purposes to obtain similar results as for bigger corpora. Sepedi and Afrikaans will be studied in this regard. The aim is to determine to what extent enlarging a corpus from e.g. one to 10 million, and from 10 million to 100 million words enhances its potential for (a) macrostructure compilation, (b) sourcing information on the most important microstructural aspects and (c) the creation of lexicographic tools. It will be argued that valuable and even sufficient data for the compilation of a specific dictionary can be extracted from a relatively small corpus of approximately one million words but that "bigger" in some instances indeed means "better".


corpus-based lexicography; lesser-resourced languages; limited corpora; corpus tools; lexicographic tools

Full Text:




  • There are currently no refbacks.

ISSN 2224-0039 (online); ISSN 1684-4904 (print)

Creative Commons License CC BY 4.0

Powered by OJS and hosted by Stellenbosch University Library and Information Service since 2011.


This journal is hosted by the SU LIS on request of the journal owner/editor. The SU LIS takes no responsibility for the content published within this journal, and disclaim all liability arising out of the use of or inability to use the information contained herein. We assume no responsibility, and shall not be liable for any breaches of agreement with other publishers/hosts.

SUNJournals Help