Corpus-based Headword Selection Procedures for LSP Word Lists and LSP Dictionaries

: In compiling both Language for Specific Purposes (LSP) word lists for foreign language learners and LSP dictionaries, the headword-selection process is of paramount importance. LSP word lists and LSP dictionaries will function effectively if they contain appropriate terms and register items, i.e. the lexical items that end users need. In this paper, we first present corpus-based LSP word lists, with special emphasis on how they were compiled. In the process, the make-up and size of the specialised corpus are important, as is the choice of the headword selection methods used. Among the possible criteria are word frequency, keyness, specialised occurrence, range, and dispersion, as well as some non-corpus linguistic methods that are more rarely applied. A greater variety of methods is used for compiling headword lists for LSP dictionaries, and of the corpus linguistic methods, frequency is typically solely applied. The article compares headword selection procedures for LSP word lists and LSP dictionaries before discussing how they can mutually inform one another.


Introduction
Word lists have many purposes in the process of teaching and learning a foreign language: they can be used as resources for vocabulary learning (Khani and Tazik 2013;Yang 2015), guidelines for designing curricula and courses, as well as for selecting reading and listening materials (Wang, Liang and Ge 2008;Jin et al. 2013), and guidelines for teachers in organising their explicit vocabulary teaching (Khani and Tazik 2013). The selection of headwords for inclusion in certain word lists has become an important strand of applied research in the field of foreign language teaching and learning in general, and language for specific purposes (LSP) in particular. As vocabulary sizes attained by native speakers are never attained by a vast majority of foreign language learners, the rationale guiding this type of research is to produce word lists of the sizes which are manageable for them to learn from. Word lists should provide language learners with the most useful words they need for a particular language function they are pursuing, for instance, attending university studies in a foreign language or reading research articles from a particular specialist field in a foreign language. Some of these functions are related to LSP contexts and for them, consequently, LSP word lists are produced. Most of them are, in fact, English for Specific Purposes (ESP) word lists, given that English is the language which is most widely taught as a foreign language around the world.
In the past, both general and LSP word lists used to be compiled manually, typically relying on the compiler's intuition and, more rarely, on an authentic corpus of a very limited size by today's standards (West 1953;cf. Gilner 2011). However, over the past two decades, they have principally been derived from vast authentic corpora of general or specialised texts, which are carefully constructed having particular types of foreign language learners in mind, and then scanned for words meeting certain criteria or a combination of criteria, such as the frequency of occurrence, distribution, range, or keyness (Coxhead 2000;Coxhead and Hirsh 2007;Brezina and Gablasova 2013;Browne et al. 2013a, Gardner andDavies 2014, etc.). The choice of the criteria and the related "cutoff" points (for instance, how frequent a word has to be to be included in a certain word list) are informed by the target users' needs and involve a number of decisions during the compilation of the list. As corpora and software solutions evolve, so do the different methods for selecting those words. In this paper we will discuss various word lists intended for LSP learning, with a focus on how they were compiled.
Selection of headwords for any dictionary, including specialised dictionaries, is also governed by the needs of its end users (Fuertes-Olivera and Arribas-Baño 2008), i.e. what should be taken into account are different types of users, user situations and user needs (Tarp 2008), according to the theory of lexicographic functions (Bergenholtz and Tarp 1995;Tarp 2008). In principle, there are four main methods of selecting headwords for dictionaries -these assume relying on the existing dictionaries, grammar and etymology, canonical literary texts, or corpora (Esandi-Baztan and Fuertes-Olivera 2020). The fourth method, compiling headword lists based on corpora, has been an option for the past few decades and is now widely used in the process of making general dictionaries. However, as Bowker (2010: 166) notes, the use of corpus linguistic methods has been rather slow to take hold in the creation of specialised dictionaries. When it comes to the methods and procedures of compiling corpora for the purpose of creating LSP dictionaries as a type of specialised dictionaries, one may only rarely find detailed accounts regarding this issue (cf. Khumalo 2015; Đurović 2021; Kruse and Heid 2021). Also, typically, few details are also presented relating to the corpus-linguistic procedures employed as part of the process of selecting headwords from specialised corpora -most studies only briefly note that it is the frequency criterion that was applied (cf. Rundell and Kilgarriff 2011), without delving into the type of details that are provided by various specialised word-list compilers (cf. Lei and Liu 2016;Todd 2017;Dang 2018, etc.). In addition, in these accounts, further corpus-linguistic procedures for headword selection beyond simple frequency are only sometimes mentioned in LSP dictionary research and projects (cf. Khumalo 2015; Đurović 2021; Kruse and Heid 2021).
In this paper we compare corpus-based headword selection procedures used for producing LSP word lists and LSP dictionaries, bearing in mind that there are some similarities (although, also, important differences) between these two types of lexicographic products. We focus on the steps in headword selection that are based on corpus bearing in mind the important place that corpora currently have in their creation. The premise from which we depart is that the two fields can mutually inform and contribute to one another in terms of the corpus-based headword selection procedures.
We will first present an overview of word lists, with a special focus on LSP word lists and how they are produced (section 2), after which we discuss LSP dictionaries and how headwords are selected for them (section 3). Section 4 compares headword selection for LSP world lists and LSP dictionaries.

Word lists
This section first provides a brief overview of general and academic word lists, after which the focus is narrowed down to discipline-specific or LSP word lists. Reviews of word lists used for the purposes of foreign language teaching and learning typically start by presenting West's General Service List (GSL) (1953( ) (cf. Coxhead 2000Coxhead and Hirsh 2007;Gardner and Davies 2014;Dang and Webb 2016;Dang, Coxhead and Webb 2017;Dang 2018;McQuillan 2020, etc.). Although West's list was not generated using computer software, it was based on an authentic word corpus of 5 million words representing General English. About 2,000 word families 1 were manually extracted and suggested to be the first words to be learned by any English language learner (they were mostly chosen according to the frequency criterion). This word list was very influential in English Language Teaching (ELT) and was used widely for decades (Nation 2013;Coxhead 2018). The emergence of the computer solutions providing data on a word's frequency and coverage in a corpus showed why -it turned out that West's list covered about 80% of the words used in most general English texts, or 4 in every 5 words. As English has about 70,000 word families (Nagy and Anderson 1984; Nation 2013), this word list proved to be a very useful resource (Coxhead 2000;Nation 2013).
In the ensuing decades, other English words were built too (for instance, Campion and Elley 1971;Praninskas 1972;Lynn 1973;Ghadessy 1979;Xue and Nation 1984, etc.), however, the next word list which can match the influence of the GSL, the Academic Word List (AWL), came only in 2000 (Coxhead 2000). Its influence lies not only in how widely it was used in ELT, but the methodology of its compilation also set standards for many of the ensuing word lists (among them, Fraser 2007;Konstantikis 2007;Wang, Liang and Ge 2008;Khani and Tazik 2013;Valipouri and Nassaji 2013;Hsu 2013;Minshall 2013;Hsu 2014;Liu and Han 2015;Yang 2015;Lei and Liu 2016, etc.). The AWL contains 570 word families which are common in academic writing. To produce the list, Coxhead compiled a corpus of 3.5 million words of academic texts. The words were extracted according to the following criteria: (1) specialised occurrence (the words had to be outside high-frequency general words (outside the GSL in this case)), (2) frequency, (3) dispersion (the words had to occur in all the corpus's subsections while featuring a certain frequency in all of them, and they also had to occur in at least half of the academic disciplines involved in the corpus) (Coxhead 2000). These carefully weighed and strict criteria ensured that the word list would have a substantial coverage in any academic corpus, not just in the one it was derived from (Coxhead 2000). Indeed, the AWL's coverage of 10% in the corpus of its origin held strongly in many other academic corpora compiled later -for instance, it featured 10.07% in the academic medical corpus (Chen and Ge 2007), 11.17% in the academic applied linguistics corpus (Vongpumivitch et al. 2009), 9.96% in academic chemistry corpus (Valipouri and Nassaji 2013), etc. These impressive results confirmed that any future word list would have to be carefully made, so as to be as useful as possible in a variety of similar language contexts.
One of the rare issues that may be contended against the AWL is the relatively small corpus it was derived from taking into account that it aims to be a general academic word list, an issue which the ensuing general lists have been trying to overcome. The dated GSL needed to be replaced and two new GSLs were offered to both research and instructional purposes in 2013. Brezina and Gablasova (2013) based their New GSL, containing about 2,500 lemmas, on a combined corpus of samples from 4 different corpora, together making 12 bil-lion words. The lemmas from each of the 4 corpora were selected based on the criterion of the Average Reduced Frequency (this measure is obtained from the absolute frequency of the word and its distribution in the corpus (Savický and Hlavácová 2002)), and then the 4 lists were compared for overlaps -the shared items entered the New GSL. The same year, Browne, Culligan and Phillips (2013a) used a 273-million-word section of the Cambridge English Corpus to derive their list of about 2,800 lemmas based on the frequency criterion. Both lists outperform the old GSL in modern corpora, typically by a few percentage points.
Browne, Culligan and Phillips (2013b) also created the New AWL, containing 963 lemmas, by excluding the words already contained in the NGSL. Another replacement for the AWL was offered by Gardner and Davies (2014), who used a 120-million-word corpus (an academic subsection from COCA), to produce a list of about 3,000 lemmas (the Academic Vocabulary List, or the AVL). They did not exclude any group of words, but employed the keyness criterion solely: the authors took into account the ratio of words in their academic corpus, compared to a non-academic corpus. Newman (2016) and Hernandez (2017) found that the AVL outperforms the old AWL, while not much data is available on how the NAWL performs against other similar lists.
Other researchers have investigated whether lists such as the AWL might be created for other languages. Cobb and Horst (2004) studied the vocabulary profile of French and determined that the high-frequency vocabulary of this language is in fact more frequent than the high-frequency vocabulary of English (2,000 most frequent French words reach a 90% coverage in most texts they examined), which excludes the need for creating additional lists for learners as these would reach very small coverages. Such results for French did not discourage other researchers to pursue creating corpus-based academic word lists for other languages, however. A Nordic joint-research project resulted in the creation of the academic word lists for Swedish, Norwegian and Danish (Kokkinakis et al. 2012;Jansson et al. 2012;Ribeck et al. 2014;Johannessen et al. 2016). Two more independent lists have also been created for Danish -a word list of general, high-frequency items (2,000 words), as well as a word list of academic vocabulary (402 words) (Jakobsen et al. 2018). An Academic Vocabulary List in Russian has also been compiled recently (Talalakina et al. 2020). The development of all these word lists heavily relied on the English word-list projects presented above.
The word lists mentioned so far include general and non-discipline specific academic word lists. Unlike these, other word lists are much more specialised and these are the focus of this paper. They and the methods used for compiling them will be presented in the following section.

Corpus-based headword selection procedures for LSP word lists
Realising the importance of the role of the communicative contexts in which certain foreign language learners will typically find themselves (Miller 2014: 305), teaching LSP began to be strongly differentiated from teaching General Foreign Language in the 1960's. LSP teachers and researchers realised that taking the learners' specific needs into account, particularly their vocabulary needs, led to more effective teaching of the specialised language that they needed. With the rise of the ITC industry, corpus-based discipline-specific word lists, produced with the use of computers and from vast corpora, began to emerge at the turn of this century. An overview of recent LSP word lists, along with the details of the corpora from which they were derived and the methods used for their creation, is given in the Appendix (while not entirely exhaustive, the table presents most of the word lists which have been described in scholarly papers). As was the case with general and academic word lists, the field of researching and compiling LSP word lists is almost exclusively related to the English language and, consequently, English word lists dominate the literature (as can be seen in the Appendix). Many of these lists follow in the AWL's footsteps given that they rely or build on the criteria used by Coxhead (2000) (see Section 2). Here we will provide a generalised description of the corpora and methods typically used to create LSP word lists.
The texts for LSP corpora are chosen bearing the LSP word list's target users in mind. The corpora from which word lists are produced are typically custom-made, which makes their creation challenging and time-consuming. They also need to be of a relevant size. The corpora from which the LSP word lists were made vary widely in terms of their size -most of the word lists were developed from a specialised corpus of 1-2 million words ( The LSP word-list compilers who intend to apply the word selection criteria of range and dispersion need to think carefully about the make-up of their corpora as they generally need to have equal subsections of texts from various subfields. These corpora thus need to be well-structured and balanced; even though this is a challenging task, some researchers were able to produce significantly large and at the same time well-structured corpora -for instance, such is the English Hard Science Spoken Corpus of 6.5 million words, produced by Dang (2018), which features 12 subsections representing 12 hard science disciplines. This size is all the more impressive bearing in mind that this is a corpus of spoken language.
The sizes of LSP word lists also vary widely -from 92 (Martínez, Beck and Panza 2009) to 1,595 headwords (Dang 2018) and, again, the needs of the end users are taken into account when determining the list's size, as is the case with dictionaries.
The criteria used for the selection of words for various recent LSP word lists can be summarised as follows: 1. frequency (the number is set depending on how large a list is wanted), 2. specialised occurrence (being outside the most frequent 2,000 or 3,000 words, so as to avoid general high-frequency words; additionally, being outside the most frequent academic words (as represented by a chosen academic word list); finally, this also assumes the exclusion of proper nouns, symbols, abbreviations, numbers, non-words, etc.), 3. dispersion (typically, occurrence in at least half the disciplines/subsections which make the corpus, or being below some dispersion value (different methods for determining these are available)), 4. keyness (being found in the specialised corpus more frequently than in a reference corpus), 5. expert opinion (experts use rating scales and assign more points to more technical words), 6. cross-comparison with specialised dictionaries.
The first four are purely corpus-linguistic methods and assume automatic extraction of words based on the word-list compiler's decisions regarding the thresholds applied, while the last two depend on consulting either experts or specialised dictionaries, and are much more time-consuming. The final two steps have been generally avoided in developing most LSP word lists; having applied several corpus-linguistic filters, the word-list compilers found them unnecessary. Experts and dictionaries were consulted in the creation of just four out twenty-four LSP word lists presented in the Appendix (Wang, Liang and Ge 2008;Valipouri and Nassaji 2013;Jin et al. 2013;Tongpoon-Patanasorn 2018).
It should be added that the finalised LSP word lists are also typically validated in one or several independent corpora (following Coxhead 2000) and, if their expected coverages hold in new corpora, such word lists are assumed to be truly representative.
Few studies, typically those early ones or those using a vast corpus, used just one word-selection criterion (typically, frequency or keyness) (Mudraya 2006), while most of the studies employed a combined approach by using several of the methods -most often, following Coxhead's method (2000) (the first three steps above). None of the studies applied all the six methods combined.
As can be seen, the field of producing and investigating word lists developed as part of applied linguistics by Anglo-Saxon scholars, who, despite the fact that there are now many authors in it who are not Anglo-Saxon, still dominate it to a large extent. Most of the word lists are in fact English word lists. The creation of word lists is guided by pragmatic principles and the field remains atheoretical. So far, in the literature, there have not been any proposals to introduce a theory which would support the field.

LSP dictionaries
As Bowker (2010) explains, LSP dictionaries belong to specialised dictionaries, i.e. dictionaries which treat specialised fields. They are also seen as a type of restricted dictionaries (Burkhanov 1998), where the term restricted does not imply their smaller size but reflects the fact that they focus on specific and precise vocabulary (Mihindou 2004). LSP dictionaries exist in many fields of knowledge (Landau 2001), while developing the metalexicography related to them is in full swing (Fuertes-Olivera and Arribas-Baño 2008). While the Anglo-Saxon strand in lexicography is mostly atheoretical (as was the case with the field of compiling word lists), the strand influenced by German and Nordic scholars advocates for developing lexicographical theories for guiding dictionary research and compilation (Fuertes-Olivera et al. 2013). As mentioned earlier, what is taken into account in the process of compiling any dictionary, including a specialised one, are the different types of users, user situations and user needs related to them, in line with the theory of lexicographic functions (Bergenholtz and Tarp 1995;Tarp 2008). This is one of the lexicographic theories which is very influential in pedagogical lexicography, including specialised pedagogical lexicography.
As for users, specialised dictionaries have a more limited target audience than general dictionaries. According to Bergenholtz and Tarp (1995), their user type is decided based on user's mother language, level of encyclopedic knowledge, and native-and foreign-language competence. Applying these criteria, the authors identify four major user types for specialised dictionaries: experts with a high level of encyclopedic and foreign language competence, experts with a high level of encyclopedic competence and low level of foreign language competence, laypersons with a low level of encyclopedic competence and foreign language competence, and layperson with a low level of encyclopedic competence and a high level of foreign language competence. Some more types are added by Fuertes-Olivera and Arribas-Baño (2008), who, among these user types, identify the following: experts from the specific field, semi-experts, experts from related or other fields, interested laypeople who would like to read some books or periodicals from the field, LSP students, translators, interpreters, etc.
Tarp (2010) argues that there are many situations in which learners can benefit from specialised dictionaries -cognitive situations include systematic study of the specialised subject field and of problems related to the translation of specialised texts; communicative situations include reception and production of specialised texts in the mother tongue and in a foreign language, as well translation of specialised texts, while practical situations refer to various operative and interpretive situations.
The mentioned user types have different needs in the mentioned different types of situations. These needs can be primary or function-related needs, which are the needs for information necessary to gain knowledge or solve a problem through using a dictionary, or they can be secondary or usage-related needs, which includes the need to know something about a specific dictionary and to know how to use it (Tarp 2008).
There are different classifications of LSP dictionaries but we will briefly mention two which are relevant for our paper. Based on their size, there are two basic typesmaximising LSP dictionaries, which attempt at covering as much of a field's terminology as possible, and minimising LSP dictionaries, in which a portion of the terminology is covered, typically only the most frequent items (Bergenholtz and Tarp 1995). Another possible classification recognises LSP dictionaries containing field-specific terms only, as opposed to general words, and hybrid LSP dictionaries, which combine both specialist and general words (Campoy Camillo 2002;Bowker 2010).
LSP dictionaries for learners are a subtype of specialised dictionaries which are intended to assist users in learning about the terms and concepts used in a specific field, in one or more languages (Bowker 2010). Their purpose is to serve as auxiliary tools in the process of teaching and learning the language for specific purposes (Fuertes-Olivera and Arribas-Baño 2008). According to the mentioned theory of lexicographic functions, they are utility tools which assist learners in the process of learning LSP.

Corpus-based headword selection procedures for LSP dictionaries
The process of headword selection is central in learner's lexicography (Xue and Tarp 2018), given that "dictionaries only function if they contain appropriate data," Nielsen (2018: 79). In this process, the three main questions that need to be posed refer to the size of the headword list, criteria and principles guiding their selection, and the empirical basis that their selection relies on (Tarp 2008). Tarp (2008) further suggests that headwords can be selected based on three sources, i.e. by means of introspection, using available descriptions in various publications (dictionaries, textbooks, etc.), and based on corpora. Building corpora as part of the preparatory stage for headword selection for LSP dictionaries is significant (Nkomo 2008: 105). Having compared corpus-based and intuitionbased approaches, Verlinde and Selva (2001: 597) argue that it is the corpusbased lexicography that gives the "strong and necessary empirical evidence to the lexicographer's personal intuition", but they also note that intuition still remains helpful in filling in the gaps in cases when corpora are not balanced. As said earlier, Bowker (2010: 166) argues that the use of corpus linguistic methods has been rather slow to take hold in the creation of specialised dictionaries, on account of the fact that not so many specialised corpora are available. Specialised corpora used for making dictionaries also tend to be relatively small, especially in comparison with the mega-corpora used for producing general dictionaries. Bowker (2010) cites the example of the specialised dictionary Dictionnaire d'apprentissage du français des affaires (DAFA) as a commendable example, given that it was based on a corpus of 25 million words. Taking into account the latest technological developments, recently, the compilation of such, relatively large, corpora has become much less of an issue.
The mentioned theory of lexicographic functions (Bergenholtz and Tarp 1995; Tarp 2008) suggests that headwords should be selected according to user's needs. When selecting headwords based on corpora, this, among other things, practically means that it is the user needs which govern the selection of texts which will enter such corpora. To illustrate how this can work in practice, we will briefly note how headwords for a Spanish accounting dictionary were selected (Fuertes-Olivera et al. 2013). Thus, following the mentioned function theory and the principle of relevance, the authors created a list of around 6,000 accounting texts, based on which three experts in accounting and one lexicographer derived a stock of around 3,000 terms. Of the corpus-linguistic methods applied in this processing of the corpus, the authors calculated the word frequencies in their corpus, to inform their decisions of which terms to include in their specialised dictionary. They also used the Internet as a corpus and performed Google searches using particular word strings to extract additional 1,000 terms. Finally, 2,000 more terms were added through intensive reading of basic accounting texts. Such a hybrid approach was applied so as to ensure that the principle of relevance is adhered to. The authors argue and add that future updates of the term stock will be done by additionally analysing the log-files related to the online use of this dictionary (Fuertes-Olivera et al. 2013).
Other authors, too, mention applying the principle of frequency as one of the key steps taken in the process of selecting headwords for dictionaries (cf. Campoy Cubillo 2002;Hanks 2012;Rundell and Kilgarriff 2011). This criterion provides "solid empirical evidence for the occurrence of a word in actual language" (Xue and Tarp 2018). At the same time, they also argue that frequency may be misleading in some specialised fields which are updated constantly, such as accounting (Fuertes-Olivera and Nielsen 2011). Rundell and Kilgarriff (2011) rightly mention the fact that frequency is not a good selection criterion for extracting multiword items as candidates for headword lists. Likewise, Nielsen (2018: 81-82) suggests that frequency solely cannot guarantee that all relevant words will be selected, but that it should be used as a basis for the further selection process.
In some LSP dictionary compilation projects, similar to the methodology used in the production of LSP word lists, frequency is combined with additional corpus-linguistic methods -thus, for instance, Khumalo (2015) and Đurović (2021) also use keyness; however, they do not ensure that the corpus contains equal shares of various subdisciplines of the field which it represents and, consequently, they do not apply the range filter. Some LSP dictionary compilers additionally use a more innovative, pattern-based approach (Kruse and Heid 2021).
Frequency and relevance are suggested as two major criteria in Xue and Tarp too (2018). However, Tarp (2008) warns against the exalted status given to corpora and corpus-linguistic methods by certain lexicographers, arguing that corpora, however large they may be, can still be unrepresentative, and that the criteria of relevance and systematicity also need to be taken into account. What may be deduced from these various accounts is that corpora play an important role when selecting headwords for specialised dictionaries, and that word frequencies in a corpus can significantly inform the process of headword selection.

Comparison of corpus-based headword selection procedures for LSP word lists and LSP dictionaries
As we have seen, headword selection procedures for both LSP word lists and LSP dictionaries are guided by the needs of their users. The chief users of LSP word lists are LSP learners. LSP word lists are also used by LSP teachers and LSP material developers but, again, to the benefit of their end users -LSP learners. When it comes to the users of LSP dictionaries, as noted earlier, LSP learners make up an important category among them, however, many more categories of users are possible as well (e.g. translators, semi-experts, experts from other fields, etc.). This basic distinction in the types of users of the two products -LSP word lists and LSP dictionaries, has implications for how headwords are selected as part of their compilation procedures. When comparing corpus-based headword selection procedures for LSP word lists and LSP dictionaries, we can see that the former are compiled using corpus-linguistics methods almost exclusively, whereas a greater complexity of methods is used for the latter. A significant part of this difference may be explained by the respective homogeneity and heterogeneity of the end users of the two products, as explained above.
The corpora from which LSP word lists are derived are rather large and typically well-structured and balanced, as we have seen. The details regarding their make-up are usually presented very precisely and transparently in the scholarly papers on LSP word lists, as well as given central prominence in them. On the other hand, the descriptions of corpora used for developing headword lists for LSP dictionaries are usually not presented in such details and, typically, in the papers describing these projects relatively little space is devoted to the process of term extraction. In addition, equal representation of various subfields is rarely ensured in them. LSP word lists compilers argue that this is a good practice which allows that the frequencies of the terms obtained to reflect all subfields equally, and we tend to agree here. An implication from this comparison is that LSP lexicographers might invest this type of effort into compiling corpora from which they intend to extract terms. Moreover, given that many useful and balanced corpora have already been produced as part of LSP word-list research, some of these could be used for making LSP dictionaries as well.
Both compilers of LSP word lists and compilers LSP dictionaries use frequency as a major criterion for deciding which words should enter their products. In the process of producing LSP word lists, compilers typically either follow the cut-off points used in seminal research (such as Coxhead 2000) or, more frequently nowadays, the cut-off points are governed by the coverage achieved with the obtained word list, a coverage that allows for a certain threshold of reading or listening comprehension to be met.
As for LSP dictionaries, in the literature we have not encountered detailed arguments around the chosen thresholds. The size of LSP dictionaries, in theory, should be governed by the user needs (even though there are always practical and financial constraints to LSP dictionary projects) (Tarp 2008). However, so far, no method of quantifying them has been developed yet (and might not be, given the complexities involved).
Research and projects involving LSP dictionaries frequently mention that frequency cannot be the sole criterion for selecting headwords, usually citing relevance as another major criterion to be applied, which, however, is much more difficult to define and employ. Likewise, as we have seen in the LSP word-list research, the criterion of simple frequency is also never applied as the sole criterion. Additional criteria may be applied as well, although these are also based on frequency to some extent. Thus, an important criterion for selecting headwords for LSP word lists is that of specialised occurrence, as presented earlier, applied by excluding words which are highly frequent in general, reference corpora (typically 2,000 to 3,000 most frequent words in the case of English). Academic words can also be excluded, to ensure more technicality. Another criterion is that of range -applying this filter ensures that a word appears in a sufficient number of a discipline's subfields, so that it is equally valuable across that discipline, and not more valuable for some subspecialisations and less valuable for others. To apply this criterion, however, one needs a corpus with equal subsections from the various subfields, as argued above. If the required structure of the corpus is not achieved, various dispersion thresholds can be applied. These criteria for guiding term extraction are rarely used when compiling headword lists for LSP dictionaries.
One more criterion frequently mentioned when compiling LSP word lists is that of keyness, which is relatively easy to apply as no special make-up of the corpus is needed for it. As explained earlier, the frequency of the words in a specialised corpus is compared against that featured in a reference corpus and so the words found to be much more frequent in that specialised corpus are identified as terms. As we have seen, this criterion is sometimes used when extracting terms for LSP dictionaries as well.
Very often, the mentioned additional criteria are used in combination when compiling LSP word lists. LSP word list compilers argue that applying them, in addition to simple frequency, ensures that the headwords selected are indeed relevant. The notion of relevance is more difficult to define for a product such as an LSP dictionary given its rather heterogeneous target audience; how-ever, applying at least some of the forementioned filters could help facilitate and automate that process.
The mentioned filters used for obtaining LSP word lists have been found deficient, however, when it comes to extracting multi-word units and collocations and, in fact, none of the word lists presented here contain such items. This is a major drawback to LSP word lists in general and a limitation that should be borne in mind if one were to apply some of the said methods for selecting preliminary headword lists for LSP dictionaries. Still, the ease with which most of the presented filters can be applied certainly recommends them for use in combination with other methods.
Once an LSP word list is obtained via corpus linguistic methods, the work of the LSP word list compiler is either completed or almost completed in most cases, whereas much more work remains for a lexicographer compiling a headword list for their LSP dictionary.
The principle of systematicity is hardly ever applied to the LSP word lists obtained via corpus-linguistic methods. For instance, the Science List (Coxhead and Hirsh 2007) contains names of some common chemical elements (such as oxygen, potassium, etc.), while the names of other common elements are not mentioned (such as sulfur, for instance); it is debateable whether the word sulfur is less useful to a science student learning English than the word potassium, for instance. Moreover, the Science List includes the word chloride, however, it does not include the name of the chemical element whose negatively charged ionic form it representschlorine. Thus, in general, word-list makers rely, perhaps too much, on automated procedures and avoid discussing these types of issues. As opposed to that, in LSP dictionary research and projects, systematicity is one of the central principles guiding the creation of headword lists. Observing this principle when developing LSP word lists, we argue, could improve them, as the illogicalities of the types exemplified above typically stem from the imperfections of the corpus (in this case, the over-presence of texts mentioning the names of some particular chemical elements) and ought to be corrected when noticed. We would argue that, however large, well-structured and balanced a corpus may be, it will always suffer from some imperfections and cannot be trusted entirely.
When finalised, LSP word lists are sometimes subjected to validation in additional corpora (not the ones they were derived from), to test how much coverage they would have in new texts. Validation, although effort-and timeconsuming, is a commendable step to be taken, in our opinion. The frequency of preliminary, candidate headword lists for LSP dictionaries, could also be checked in additional specialised corpora, so as to, perhaps, rule out some candidate terms which in validation corpora feature significantly lower frequencies as opposed to that from the first corpus.
In developing LSP word lists, experts from the specialist fields are almost never involved, while they are always involved in compiling LSP dictionaries. This step is usually skipped in the making of modern word lists, given that several automatic corpus-linguistic filters have already been applied. Although this is a demanding step, involving experts in the creation of any LSP product is advisable.

Conclusion
In this paper, we presented most modern LSP word lists and commented on how they were created. We also discussed corpus-based headword selection procedures for LSP dictionaries. A number of both similarities and differences were found in the two selection procedures and it was noted that both of them could, in some ways, benefit from being mutually informed. On the one hand, more effort could be invested in the creation of LSP corpora, in terms of their size, make-up and balance, and also more corpus-linguistic selection procedures could be applied when compiling headword lists for LSP dictionaries than is currently typically the case, to facilitate the process. More transparency and precision when reporting on the corpora used and the corpus-linguistic methods applied for compiling headwords lists for LSP dictionaries is also advised. Lists obtained should also be validated in additional corpora, when possible.
On the other hand, the creation of LSP word lists could be improved by applying additional non-corpus linguistic methods in their compilation, which is necessary to eliminate the illogicalities stemming from imperfectly balanced corpora, as well as to add the necessary multi-word units to them.
Another observation that imposes itself from the comparison made in this paper is that the compilation and study of word lists remain atheoretical, while at least one strand of LSP dictionaries research has strong theoretical foundations. As we conclude this paper, we will ask the reader and ourselves if, perhaps, the moment has arrived that the field of word-list compilation and research be supported by a theory similar to that of the theory of lexicographic functions.

1.
A word family includes the headword with all its inflected and derived forms (for instance, suggest, suggests, suggested, suggesting, suggestion, suggestions). -frequency of 28.57 per 1 mill -ratio of at least 1.5 (at least 50% higher frequency in the academic corpus than in a non-academic corpus) -occurrence of 20% of the expected frequency in at least half the subsections -dispersion of at least 0.5 (Jullian's D) -no lemma should occur more than 3 times the expected frequency in more than any 3 out 21 subsections -special meaning criterion checked via 2 medical dictionaries Moini and Islamizadeh Hard science spoken word list 1,595 word families 6.5 mill. words of spoken language from 12 disciplines -occurrence in at least half the disciplines and both subsections of the corpus -frequency of at least 175 in the corpus -dispersion (DP value below 0.6) Tongpoon-Patanasorn -keyness (1.5 more frequent in the economics corpus than in other corpora) -degree of dispersion over 0.25 -minimum frequency of 10 in the corpus