Semi-automatic Term Extraction for the African Languages, with Special Reference to Northern Sotho *

  • Elsabé Taljard Department of African Languages, University of Pretoria, Pretoria, Republic of South Africa
  • Gilles-Maurice de Schryver Department of African Languages and Cultures, Ghent University, Ghent, Belgium and Department of African Languages, University of Pretoria, Pretoria, Republic of South Africa
Keywords: TERMINOLOGY, TERMINOGRAPHY, MANUAL EXCERPTION, READING AND MARKING, SEMI-AUTOMATIC TERM EXTRACTION, RETRIEVAL, AFRICAN LANGUAGES, NORTHERN SOTHO (SEPEDI), RAW CORPORA, PRETORIA SEPEDI CORPUS (PSC), WORDSMITH TOOLS, WEIRDNESS RATIO, KEY WORD, LOG-LIKELIHOOD, RECALL, PRECISION, MOTHER TERM, SINGLE-WORD TERM, MULTI-WORD TERM, STEM, ROOT, KEY-WORD-IN-CONTEXT (KWIC), COLLOCATION, COLLOCATE, LEXICAL GAP, CLUSTER, LINGUISTICS TERMINOLOGY LEXICON

Abstract

Abstract: Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If Africanlanguage terminologists are willing to take their rightful place in the new millennium, they must not only take cognisance of this trend but also be ready to implement the new technology. In this article it is advocated that the best way to do the latter two at this stage, is to opt for computationally straightforward alternatives (i.e. use 'raw corpora') and to make use of widely available software tools (e.g. WordSmith Tools). The main aim is therefore to discover whether or not the semiautomatic extraction of terminology from untagged and unmarked running text by means of basic corpus query software is feasible for the African languages. In order to answer this question a fullblown case study revolving around Northern Sotho linguistic texts is discussed in great detail. The computational results are compared throughout with the outcome of a manual excerption, and vice versa. Attention is given to the concepts 'recall' and 'precision'; different approaches are suggested for the treatment of single-word terms versus multi-word terms; and the various findings are summarised in a Linguistics Terminology lexicon presented as an Appendix.Keywords: TERMINOLOGY, TERMINOGRAPHY, MANUAL EXCERPTION, READING AND MARKING, SEMI-AUTOMATIC TERM EXTRACTION, RETRIEVAL, AFRICAN LANGUAGES, NORTHERN SOTHO (SEPEDI), RAW CORPORA, PRETORIA SEPEDI CORPUS (PSC), WORDSMITH TOOLS, WEIRDNESS RATIO, KEY WORD, LOG-LIKELIHOOD, RECALL, PRECISION, MOTHER TERM, SINGLE-WORD TERM, MULTI-WORD TERM, STEM, ROOT, KEY-WORD-IN-CONTEXT (KWIC), COLLOCATION, COLLOCATE, LEXICAL GAP, CLUSTER, LINGUISTICS TERMINOLOGY LEXICONSenaganwa: Go ntšhwa ga mareo ka tirišo ya seripa sa semotšhene malebanale maleme a Afrika, šedi ye kgolo e lego Sesotho sa Leboa (Sepedi). Gontšhwa ga mareo ka tirišo ya seripa sa semotšhene go tšwa ka gare ga dikhophase go thomile go ba setlwaedi go hlangweng ga mananeo a mareo, dipanka tša mareo goba dipukuntšu mererong yeo eitšego lefaseng ka bophara. Ge e le gore boramareo ba maleme a Afrika ba ikemišeditše go tšeamadulo a bona mo mileneamong wo mofsa, ga ba swanela go hlokomela fela tsela ye, eupša baswanetše gape ke go ikemišetša go diriša theknolotši ye mphsa. Mo taodišwaneng ye go hlalošwagore mo nakong ye, tsela ye kaone ya go dira dilo tše pedi tše go boletšwego ka tšona ke go kgethaditlhamolo tša thwii tšeo di dirišago khomphutha (se se ra gore tšhomišo ya khophase) le gošomiša ditlabakelo tša software (bj.k. WordSmith Tools) tšeo di lego gona gohle. Ka fao maikemišetšoa magolo ke go humana ge e ka ba go ntšhwa ga mareo ka seripa sa semotšhene go tšwa ka gare gakhophase yeo e se nago ditlaleletšo tšeo di tseneletšego ka mašakaneng, tša go hlahla, go kadirišwa malemeng a Afrika goba aowa. Gore re kgone go araba potšišo ye, go hlalošitšwe katsinkelo mohlala wa taba ya go nyakišišwa yeo e amanego le diteng tša thutapolelo tša Sesotho saLeboa. Dipoelo tšeo di humanwego ka go diriša khomphutha di bapetšwa ka gohle le dipoelo tšeodi humanwego ge go dirišwa kgetho ya mantšu ka matsogo. Šedi e fiwa dikgopolo tša kgakologelo(recall) le nepagalo (precision); mekgwa yeo e fapafapanego e a akanywa gore e kgone go hlathollamareo a lentšu le tee ge a bapetšwa le mareo a mantšu a mantši; gomme dikhumano tšeo difapanego di akaretšwa ka gare ga pukuntšu ya Mareo a Thutapolelo yeo e tšweletšwago bjalo kaMamatletšo.Mantšu a bohlokwa: MAREO, MONGWALO WA MAREO, KGETHO YA MANTŠU KAMATSOGO, GO BALA LE GO SWAYA, GO NTŠHWA GA MAREO KA SERIPA SA SEMOTŠHENE,GO HWETŠA GAPE, MALEME A AFRIKA, SESOTHO SA LEBOA (SEPEDI),DIŠEGONTŠU (DIKHOPHASE), KHOPHASE YA SESOTHO SA LEBOA YA TSHWANE (KST),WORDSMITH TOOLS, WEIRDNESS RATIO, LENTŠU LA BOHLOKWA, LOG-LIKELIHOOD,KGAKOLOGELO, NEPAGALO, LEREO LA MOTHEO, LEREO LA LENTŠU LE TEE, LEREO LAMANTŠU A MANTŠI, KUTU, MODU, LENTŠU LA BOHLOKWA KA GARE GA KAMANO(LBGK), PEAKANYO, BEAKANYA, TLHOKEGO YA LEREO, SEHLOPHA, PUKUNTŠU YAMAREO A THUTAPOLELO
How to Cite
Taljard, E., & de Schryver, G.-M. (1). Semi-automatic Term Extraction for the African Languages, with Special Reference to Northern Sotho *. Lexikos, 12. https://doi.org/10.5788/12-0-760
Section
Navorsingsartikels / Research Articles