Semi-automatic Term Extraction for the African Languages, with Special Reference to Northern Sotho

  • Elsabé Taljard Department of African Languages, University of Pretoria, Pretoria, Republic of South Africa
  • Gilles-Maurice de Schryver Department of African Languages and Cultures, Ghent University, Ghent, Belgium and Department of African Languages, University of Pretoria, Pretoria, Republic of South Africa

Abstract

Abstract: Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If Africanlanguage terminologists are willing to take their rightful place in the new millennium, they must not only take cognisance of this trend but also be ready to implement the new technology. In this article it is advocated that the best way to do the latter two at this stage, is to opt for computationally straightforward alternatives (i.e. use 'raw corpora') and to make use of widely available software tools (e.g. WordSmith Tools). The main aim is therefore to discover whether or not the semiautomatic extraction of terminology from untagged and unmarked running text by means of basic corpus query software is feasible for the African languages. In order to answer this question a fullblown case study revolving around Northern Sotho linguistic texts is discussed in great detail. The computational results are compared throughout with the outcome of a manual excerption, and vice versa. Attention is given to the concepts 'recall' and 'precision'; different approaches are suggested for the treatment of single-word terms versus multi-word terms; and the various findings are summarised in a Linguistics Terminology lexicon presented as an Appendix. Keywords: TERMINOLOGY, TERMINOGRAPHY, MANUAL EXCERPTION, READING AND MARKING, SEMI-AUTOMATIC TERM EXTRACTION, RETRIEVAL, AFRICAN LANGUAGES, NORTHERN SOTHO (SEPEDI), RAW CORPORA, PRETORIA SEPEDI CORPUS (PSC), WORDSMITH TOOLS, WEIRDNESS RATIO, KEY WORD, LOG-LIKELIHOOD, RECALL, PRECISION, MOTHER TERM, SINGLE-WORD TERM, MULTI-WORD TERM, STEM, ROOT, KEY-WORD-IN-CONTEXT (KWIC), COLLOCATION, COLLOCATE, LEXICAL GAP, CLUSTER, LINGUISTICS TERMINOLOGY LEXICON Senaganwa: Go ntÅ¡hwa ga mareo ka tiriÅ¡o ya seripa sa semotÅ¡hene malebanale maleme a Afrika, Å¡edi ye kgolo e lego Sesotho sa Leboa (Sepedi). GontÅ¡hwa ga mareo ka tiriÅ¡o ya seripa sa semotÅ¡hene go tÅ¡wa ka gare ga dikhophase go thomile go ba setlwaedi go hlangweng ga mananeo a mareo, dipanka tÅ¡a mareo goba dipukuntÅ¡u mererong yeo eitÅ¡ego lefaseng ka bophara. Ge e le gore boramareo ba maleme a Afrika ba ikemiÅ¡editÅ¡e go tÅ¡eamadulo a bona mo mileneamong wo mofsa, ga ba swanela go hlokomela fela tsela ye, eupÅ¡a baswanetÅ¡e gape ke go ikemiÅ¡etÅ¡a go diriÅ¡a theknolotÅ¡i ye mphsa. Mo taodiÅ¡waneng ye go hlaloÅ¡wagore mo nakong ye, tsela ye kaone ya go dira dilo tÅ¡e pedi tÅ¡e go boletÅ¡wego ka tÅ¡ona ke go kgethaditlhamolo tÅ¡a thwii tÅ¡eo di diriÅ¡ago khomphutha (se se ra gore tÅ¡homiÅ¡o ya khophase) le goÅ¡omiÅ¡a ditlabakelo tÅ¡a software (bj.k. WordSmith Tools) tÅ¡eo di lego gona gohle. Ka fao maikemiÅ¡etÅ¡oa magolo ke go humana ge e ka ba go ntÅ¡hwa ga mareo ka seripa sa semotÅ¡hene go tÅ¡wa ka gare gakhophase yeo e se nago ditlaleletÅ¡o tÅ¡eo di tseneletÅ¡ego ka maÅ¡akaneng, tÅ¡a go hlahla, go kadiriÅ¡wa malemeng a Afrika goba aowa. Gore re kgone go araba potÅ¡iÅ¡o ye, go hlaloÅ¡itÅ¡we katsinkelo mohlala wa taba ya go nyakiÅ¡iÅ¡wa yeo e amanego le diteng tÅ¡a thutapolelo tÅ¡a Sesotho saLeboa. Dipoelo tÅ¡eo di humanwego ka go diriÅ¡a khomphutha di bapetÅ¡wa ka gohle le dipoelo tÅ¡eodi humanwego ge go diriÅ¡wa kgetho ya mantÅ¡u ka matsogo. Å edi e fiwa dikgopolo tÅ¡a kgakologelo(recall) le nepagalo (precision); mekgwa yeo e fapafapanego e a akanywa gore e kgone go hlathollamareo a lentÅ¡u le tee ge a bapetÅ¡wa le mareo a mantÅ¡u a mantÅ¡i; gomme dikhumano tÅ¡eo difapanego di akaretÅ¡wa ka gare ga pukuntÅ¡u ya Mareo a Thutapolelo yeo e tÅ¡weletÅ¡wago bjalo kaMamatletÅ¡o. MantÅ¡u a bohlokwa: MAREO, MONGWALO WA MAREO, KGETHO YA MANTÅ U KAMATSOGO, GO BALA LE GO SWAYA, GO NTÅ HWA GA MAREO KA SERIPA SA SEMOTÅ HENE,GO HWETÅ A GAPE, MALEME A AFRIKA, SESOTHO SA LEBOA (SEPEDI),DIÅ EGONTÅ U (DIKHOPHASE), KHOPHASE YA SESOTHO SA LEBOA YA TSHWANE (KST),WORDSMITH TOOLS, WEIRDNESS RATIO, LENTÅ U LA BOHLOKWA, LOG-LIKELIHOOD,KGAKOLOGELO, NEPAGALO, LEREO LA MOTHEO, LEREO LA LENTÅ U LE TEE, LEREO LAMANTÅ U A MANTÅ I, KUTU, MODU, LENTÅ U LA BOHLOKWA KA GARE GA KAMANO(LBGK), PEAKANYO, BEAKANYA, TLHOKEGO YA LEREO, SEHLOPHA, PUKUNTÅ U YAMAREO A THUTAPOLELO
Published
2002-11-30
How to Cite
Taljard, E., & de Schryver, G.-M. (2002). Semi-automatic Term Extraction for the African Languages, with Special Reference to Northern Sotho. Lexikos, 12. https://doi.org/10.5788/12-0-760
Section
Navorsingsartikels / Research Articles