Divergent Approaches to Corpus Processing: The Need for Standardisation

Esau Mangoya

doi:10.5788/19-1-177

Divergent Approaches to Corpus Processing: The Need for Standardisation

Esau Mangoya African Languages Research Institute, University of Zimbabwe, Harare, Zimbabwe

Résumé

Abstract: This article discusses some problems encountered in the processing of the Shona corpus. Most of the problems deal with the handling of adoptives, punctuation and individuals' idiolects. It also discusses the problem ensuing from an attempt to standardise the formats used in the handling of the corpus. The way a corpus is processed is critical in determining its quality. This article aims to show how the different lin-guistic backgrounds of the processors affect the appreciation of some vital aspects of the corpus. One of the acclaimed advantages of a corpus is that it allows research to be done on natural language. An ideal corpus should be a body of texts combined in a principled way to become a reliable language bank from which researchers retrieve data for various research purposes. With a good corpus, data can be provided giving an authoritative body of linguistic evidence which can support generalisations and against which hypotheses can be tested. As this proves the invaluable status of a corpus, the article assesses the processing of the Shona corpus and discusses how some aspects of the processing may impact negatively on its quality.

PDF (English)

Publié-e

2009-11-30

Comment citer

Mangoya, E. (2009). Divergent Approaches to Corpus Processing: The Need for Standardisation. Lexikos, 19(1). https://doi.org/10.5788/19-1-177

Télécharger la référence

Numéro

Vol. 19 No 1 (2009): Lexikos 19 Supplement

Rubrique

Artikels/Articles

Copyright of all material published in Lexikos will be vested in the Board of Directors of the Woordeboek van die Afrikaanse Taal. Authors are free, however, to use their material elsewhere provided that Lexikos (AFRILEX Series) is acknowledged as the original publication source.

Creative Commons License CC BY 4.0