Semi-automating the Reading Programme for a Historical Dictionary Project

Tim van Niekerk; Johannes Schäfer; Ulrich Heid

doi:10.5788/28-1-1468

Semi-automating the Reading Programme for a Historical Dictionary Project

Tim van Niekerk Dictionary Unit for South African English, Rhodes University, Grahamstown, South Africa
Johannes Schäfer Department of Information Science and Natural Language Processing, University of Hildesheim, Hildesheim, Germany
Ulrich Heid Department of Information Science and Natural Language Processing, University of Hildesheim, Hildesheim, Germany

Abstract

This paper describes the resources and software procedures used or developed in a major enabling step towards the revision of the scholarly reference work A Dictionary of South African English on Historical Principles (DSAE, Silva et al. 1996), namely the semi-automatic generation of a digitally-sourced lexical database on which new and updated dictionary entries will be based; as well as the addition, in parallel, of a new corpus of South African English (SAE) to the project. Drawing on online data sources and an extensive list of known SAE word forms, we have developed a software toolchain to gather, encode, annotate and collate textual sources, producing: (i) a 3.1-billion part-of-speech-annotated corpus of South African English; (ii) a lexical database of illustrative quotations for over 20,000 known SAE word forms, available for selection at the entry-revision stage; and (iii) a list of potential new variant spellings and headword inclusion candidates. These steps replace, where recent electronic sources are concerned, the mechanical aspects of quotation gathering, normally undertaken manually through a reading programme requiring years of teamwork to acquire sufficient coverage (cf. Hicks 2010).

PDF (English)

Veröffentlicht

2018-12-17

Zitationsvorschlag

van Niekerk, T., Schäfer, J., & Heid, U. (2018). Semi-automating the Reading Programme for a Historical Dictionary Project. Lexikos, 28(1). https://doi.org/10.5788/28-1-1468

Bibliografische Angaben herunterladen

Ausgabe

Bd. 28 (2018)

Rubrik

Artikels/Articles

Copyright of all material published in Lexikos will be vested in the Board of Directors of the Woordeboek van die Afrikaanse Taal. Authors are free, however, to use their material elsewhere provided that Lexikos (AFRILEX Series) is acknowledged as the original publication source.

Creative Commons License CC BY 4.0