New Advances in Corpus-based Lexicography*

Arvi Hurskainen

Abstract


Abstract: This article presents various approaches used in corpus-based computational lexico-graphy. A claim is made that in order for computational lexicography to be efficient, precise and comprehensive, it should utilize the method where the corpus text is first analysed, and the results of this analysis is then processed further to meet the needs of a dictionary. This method has several advantages, including high precision and recall, as well as the possibility to automate the process much further than with more traditional computational methods. The frequency list obtained by using the lemma (the equivalent of the headword) as basis helps in selecting the words to be in-cluded in the dictionary. The approach is demonstrated through various phases by applying SALAMA (the Swahili Language Manager) to the process. Manual work will be needed in the phase when examples of use are selected from the corpus, and possibly modified. However, the list of examples of use, arranged alphabetically according to the corresponding headword, can also be produced automatically. Thus the alphabetical list of headwords with examples of use is the mate-rial on which the lexicographer works manually. The article deals with problems encountered in compiling traditional printed dictionaries, and it excludes electronic dictionaries and thesauri.

Keywords: LEXICOGRAPHY, DICTIONARY, LANGUAGE TECHNOLOGY, COMPUTA-TIONAL LINGUISTICS, AUTOMATIC COMPILATION, DICTIONARY TESTING, INFORMA-TION RETRIEVAL, MORPHOLOGICAL ANALYSIS, SEMANTIC ANALYSIS, DISAMBIGUA-TION, HEURISTICS

Opsomming: Nuwe ontwikkelinge in korpusgebaseerde leksikografie. Hier-die artikel beskryf verskillende benaderings wat in korpusgebaseerde rekenaarleksikografie ge-bruik word. Daar word aangevoer dat vir rekenaarleksikografie om doelmatig, noukeurig en omvattend te wees, dit die metode behoort te gebruik waarby die korpusteks eers ontleed word, en die resultaat van hierdie ontleding dan verder verwerk word om te voldoen aan die behoeftes van 'n woordeboek. Hierdie metode het verskillende voordele, insluitende 'n hoë mate van noukeurig-heid en herwinning, sowel as die moontlikheid om die proses baie verder as met meer tradisionele rekenaarmetodes te outomatiseer. Die frekwensielys verkry deur die lemma (die ekwivalent van die trefwoord) as basis te gebruik, help met die keuse van woorde vir insluiting in die woordeboek. Die benadering word geïllustreer deur verskillende fases van die aanwending van SALAMA (die Swahili Language Manager) in die proses. Werk met die hand sal nodig wees gedurende die sta-dium wanneer gebruiksvoorbeelde uit die korpus gekies en moontlik aangepas word. Die lys gebruiksvoorbeelde, alfabeties gerangskik volgens die ooreenstemmende trefwoord, kan egter ook outomaties voortgebring word. Die artikel behandel probleme wat teëgekom word by die same-stelling van 'n tradisionele gedrukte woordeboek, en dit sluit elektroniese woordeboeke en tesou-russe uit.

Sleutelwoorde: LEKSIKOGRAFIE, WOORDEBOEK, TAALTEGNOLOGIE, REKENAAR-LINGUISTIEK, OUTOMATIESE SAMESTELLING, WOORDEBOEKTOETSING, INLIGTINGS-HERWINNING, MORFOLOGIESE ONTLEDING, SEMANTIESE ONTLEDING, ONDUBBELSIN-NIGMAKING, HEURISTIEK


Keywords


LEXICOGRAPHY; DICTIONARY; LANGUAGE TECHNOLOGY; COMPUTA-TIONAL LINGUISTICS; AUTOMATIC COMPILATION; DICTIONARY TESTING; INFORMA-TION RETRIEVAL; MORPHOLOGICAL ANALYSIS; SEMANTIC ANALYSIS; DISAMBIGUA-TION; HEURISTICS

Full Text:

PDF


DOI: https://doi.org/10.5788/13-0-725

Refbacks

  • There are currently no refbacks.



ISSN 2224-0039 (online); ISSN 1684-4904 (print)

Creative Commons License CC BY 4.0


Powered by OJS and hosted by Stellenbosch University Library and Information Service since 2011.


Disclaimer:

This journal is hosted by the SU LIS on request of the journal owner/editor. The SU LIS takes no responsibility for the content published within this journal, and disclaim all liability arising out of the use of or inability to use the information contained herein. We assume no responsibility, and shall not be liable for any breaches of agreement with other publishers/hosts.

SUNJournals Help