Corpus Linguistics Methods for Building ESP Word Lists, Glossaries and Dictionaries on the Example of Marine Engineering Word

In addition to the general English knowledge required for nearly any human occupation today, vocabulary competence has been especially focused on seeking to keep pace with boosting Englishes for Specific Purposes. Owing to the possibilities offered by contemporary software solutions, corpus linguistics has been able to answer some specific questions on the vocabulary demand of texts, as well as to provide concentrated vocabulary lists according to their frequency in real-life texts (corpora). Aiming to provide our target learners of English for marine engineering purposes with a practical vocabulary tool to help them reach an adequate reading comprehension text coverage of 95%, we developed a marine engineering word list of 337 word families, accompanied by a list of 73 transparent compounds, which were derived from the corpus of marine engineering instruction books with 1,769,821 running words. The list can be studied in university classes or training courses for seafarers, through various types of vocabulary exercises, but it might also assist in building technical glossaries and dictionaries. The methodology used and procedures applied in the paper should hopefully be of assistance to other authors and language instructors working on other areas of technical English.


Introduction
Although certain traces of language examination conducted upon collections of written texts can be found even from the age of antiquity, the notion of corpus linguistics in the modern sense has been related to the appearance of electronic corpora and other computing resources beginning in the 1970s (Nation 2016). Interestingly, the accelerated rise and application of this type of linguistic research has overlapped with the boosting of lexicography, all as a consequence of renewed interest in vocabulary and the new possibilities afforded by information technologies. The overall impact on language teachers has been two-fold. On the one hand, the more and more technical and demanding areas of English for Specific Purposes (ESP), accompanied by the massive development of information technologies, have imposed significant new challenges in terms of teacher competencies and course design. At the same time, new areas of research and methodologies have offered enormous opportunities in terms of the detailed computational analysis of abundant authentic material, as well as the testing and comparison of the obtained results, leading to more effective and learner-oriented language course material.

Theoretical Background
The main ideas behind this renewed interest in vocabulary, perceived now as being in the "limelight of foreign language teaching and learning" (Gao and Liu 2020) is that "vocabulary is central to understanding and using a language at any level" (Hirsch and Coxhead 2009: 5), but that it also represents one of the major problems and most complex difficulty of any practical foreign language programme (Twaddel 1973: 61). Modern methodologies have provided researchers with the opportunity to "measure" vocabulary types and loads, as well as learning objectives. The central question to start with, therefore, regards the amount of vocabulary needed for adequate reading comprehension.

Reading comprehension
The quantitative vocabulary analysis of real-life texts (corpora) aims to answer http://lexikos.journals.ac.za; https://doi.org/10.5788/31-1-1647 (Article) the question of the number of words needed to reach adequate reading comprehension, typically set at the levels of 95% and 98%. A "word" as a unit of measurement in this type of research most frequently stands for a word family, comprised of the headword with all its derivatives and inflected forms. For example, read, reads, reading, reader, readable, readability, and so on makes up one single word family, taking into account the learning burden, i.e. the amount of morphological knowledge the learners are expected to have (Nation 2016: 9). The threshold of 98% was determined by Hu and Nation (2000) and also recommended by Carver (1994) and Kurnia (2003), claiming that 98% of known vocabulary (coverage of the text) or one out of 50 words unknown would be acceptable in order to understand a text adequately. Certain other authors, such as Laufer (1989), Laufer and Ravenhorst-Kalovski (2010) and Van Zeeland and Schmitt (2012) have advocated 95% as sufficient coverage of known words in a text in order to understand it correctly without additional assets and aids. Generally, authors have been guided by the recommendation summarized by Laufer and Ravenhorst-Kalovski (2010) by which an optimal reading comprehension would anticipate the ideal 98% lexical coverage of a text, which is usually achieved at a level of about 8,000 words, while the minimal adequate comprehension would be that of 95%, which we would expect to find at the level of 4,000-5,000 known word families, including proper nouns. The remaining 2-5% are expected to be guessed from the relevant context. Here, the type of vocabulary should be carefully considered, since, as has been shown by numerous studies involving ESPs, the desired thresholds in professional texts are barely reachable, or sometimes not reachable at all, even with all the available general and other available lists of English words; this, then, is the reason for the constant development of new specialized word lists (WL).

Word lists
Considering that both native and non-native speakers acquire vocabulary in the order of its frequency (Nation 2006), word lists have been used in language teaching for a long time, both as stand-alone lists and a part of textbooks and other teaching materials (Folse 2004;Nation 2006). General English (GE) lists provide the learners with the English words that are most frequently found in a variety of texts. Contemporary research into "specialized or technical vocabulary has focused primarily on producing word list of technical vocabulary in professional fields of expertise in English for Specific Purposes" (Coxhead and Demecheleer 2018: 84). Since the focus of our attention is a highly technical branch of ESP, we are following the general tendency and recommendation of upgrading the first 2,000 or 3,000 words with specialized vocabulary lists, which aim to reach the adequate reading comprehension threshold in the most efficient way. For our specific research, we used the findings of previous research where the target corpus was tested against the available general and relevant specialized word list for both its coverage ratio and in order to determine the lexical profile of the texts. These findings are explained in more detail in Section 4.1 Previous research findings.

English (vocabulary) for marine engineering purposes
Aiming to respond to the vocabulary needs of our target English language learners, that is, undergraduate students and trainees on Marine Engineering courses, we were led by their most practical needs in terms of language skills. Although English has become the lingua franca of almost all areas of international activity, this has been especially true in terms of maritime affairs, with English formally operating as its official language, as of the establishment of the International Maritime Organization in London in 1948. Consequently, in targeting our research objectives, we followed the expert advice and extensive teaching experience in the area. Even more importantly, we also followed the official requirements and recommendations made by the IMO's International Convention on the Standards of Training, Certification and Watchkeeping for Seafarers (STCW, Part 2.2) and the Model Course 3.17 -Maritime English, notably the part on Specialized Maritime English dedicated to marine engineering courses of English. Apart from general communication skills in terms of using internal communication systems, the majority of the language skills requirements (about 90% of the anticipated course and self-study hours) are dedicated to "Adequate knowledge of the English language to use engineering publications" (IMO Model Course 3.17 2015: 153). Led by these clear instructions, the area of our interest has been the reading comprehension of marine engineering publications, specifically instruction books.

The Corpus
Instruction books have become an indispensable "tool" for the everyday onboard activities of marine engineers, being used in contexts ranging from familiarization with the ship's systems and machinery to regular maintenance activities and repairs. Therefore, an adequate reading comprehension of these publications is of utmost importance for the majority of their scope of activities, as well as for the entire shipping industry. Following certain expert advice, primarily that of Chief Engineers, we sought to create a relevant selection of instructing engineering material for a tanker ship, a container ship, a cruise ship and an off-shore vessel. Additional material was added for the purpose of diversity and to cover up-to-date technologies, for example, in regard to propulsion, where we added instruction books for dual propulsion (both fuel oil and gas) and electric-driven engines. These additional types of instruction books and manuals were added in terms of the tanker ships deemed as the most numerous ones in worldwide fleets, but also very similar to some others, e.g. container ships, especially as regards propulsion. Other material included technical manuals for various essential onboard machinery and systems, as presented in Table 1. In order to avoid any possibility of copyright or commerciality issues, we will not state the names of the manufacturers or the vessels. The final corpus material comprises thousands of pages of electronic material of items varying in length, converted to accommodate the software requirements. "The painstaking process" (Nation 2016: 224) of the additional cleaning of material (in relation to tables, references, brands, typos and the like) was applied to the best of the author's abilities, considering the huge amount of the material, originally found mostly in scanned formats. The prepared Corpus of Ship Instruction Books (CSIB) was finalized with 1,769,821 running words (tokens). Bearing in mind that we have a very technical and discipline-specific genre in question, we may say that our corpus is of representative size and content, so as to guarantee the validity of results and conclusions produced.

Methodology
The initial method we used for our research is called Lexical Frequency Profiling (Laufer and Nation 1995). It provides authors and language teachers with the opportunity to analyse corpora in terms of vocabulary types and quantities and to test the coverage of available word lists, as well as to create new ones. The software that has recently been found most useful and convenient for this purpose is AntWordProfiler 1.4.0w, developed by Laurence Anthony (2014), as an upgraded version of the previously used RANGE programme (Nation and Heatley 1994). As a starting point, we used the results of previous research on the same corpus (Đurović, Vuković Stamatović and Vukičević 2021) in terms of coverage by the existing genearal service and engineering word lists, aiming to prove the lexical demand of the corpus, as well as the need to build a specific marine engineering word list (MEWL). For the latter purpose, the same programme (AntWordProfiler 1.4.0w) was used for several reasons. First, it has proved itself to be one of the most appropriate programmes for building specialized vocabulary lists (or glossaries or dictionaries) since, in contrast to some other programmes used for building vocabulary lists, it can exclude the most frequent general vocabulary from further analysis, which is estimated to be known by LSP learners. Another reason is the value of having comparable results and being able to make reference to relevant findings and word lists built upon the same methodology. Finally, the Anthony's software is readily available, quite simple to use for any language and offered free of charge.
In setting the cut-off point in the frequency count, we were guided by the goal of reaching the desired level of reading comprehension (of ship instruction books), which anticipated a minimum of 95% of known vocabulary. With a list which was estimated to be of a size to represent an attainable task for language learners, taking into account the available time for learning, the reached level of 95% corpus coverage (with the most frequent GE words and the MEWL) would at the same time be its positive evaluation result (Dang and Webb 2016: 133).
For formatting the lists into headwords only, lemmas or expanding them into all-family-members form, we used the Familizer + Lemmatizer programme (Cobb 2018). For corpus preparation and converting it into "plain text" format, as required by the software, we used AntFileConverter (Anthony 2017).
Additionally, for detecting the concordances of a certain word within the corpus, the AntConc software (Anthony 2019) can be used. Owing to this kind of software programmes, each word from our list can be checked for concordances and n-grams to be included in a glossary or dictionary.

Previous research findings
In examining the target corpus, we first analysed its vocabulary load and types in terms of reaching the desired reading comprehension. Aiming to provide precise answers as to the quantity and types of vocabulary needed for the purpose and following the methodologies used by recognized authors in the field, these questions were the focus of our previous research (Đurović, Vuković Stamatović and Vukičević 2021). As such, we here briefly present the lexical profiling methodology and the results, since they clearly point to the need for a specialized marine engineering word list and provide solid justification for our current research. In particular, in testing our corpus against the General Service List (West 1953) and the Academic Word List (Coxhead 2000), which are usually applied together in this kind of research, the cumulative coverage amounted to 79.46% (71.39% + 8.07%, respectively), which is lower than the average of 86.1% found for academic texts (Nation 2000: 27). Considering general English only, the coverage level of 71.39% is significantly below the usual coverage of 78-98%, as reported for various types of written texts (e.g. Nation and Waring 1997). This generally means that, with knowing only 2,000 first English words, even with the most common academic words, about every fifth word of the text remains unknown (20.54%), which would make both reading and understanding the text very difficult.
When calculating the total amount of general English words needed for adequate comprehension, using the more contemporary Nation's word lists extracted from the huge corpora of the British National Corpus and the Corpus of Contemporary American (BNC/COCA, available at https://www.wgtn.ac. nz/lals/resources/paul-nations-resources/vocabulary-lists), a coverage of 95% was reached not sooner than with the 12,000 most frequent English words, whilst 98% coverage is not reachable even with all the available 25,000 English words, including the additional four lists of proper nouns, abbreviations, transparent compounds and marginal words. This means that a clear understanding of this kind of text is almost unattainable, even for a native speaker, without an engineering background.
In taking into consideration the available word list from adjacent fields and genres, the only two available lists which we found to be relevant and appropriate to this kind of research were Ward's Basic Engineering English Word List (BEEWL) (Ward 2009), and Hsu's Engineering English Word List (EEWL) (Hsu 2014). Both lists were extracted from undergraduate textbooks appropriate to various engineering fields. In addition, both of them showed lower coverage in our target corpus (13.53% compared to 16.4%, and 10.11% compared to 14.3%, respectively). These results are also understandable considering the very technical nature of our target corpus, when compared to that of textbooks, which are more narrative in nature.
Bearing all these findings in mind, we sought to create a specialized marine engineering word list, aiming to reach the adequate reading comprehension level for marine engineering instruction books. Taking into consideration that this kind of genre is abundant in both diagrams and tables, as well as supplemental explanations and abbreviations accompanying these materials, we opted to place the adequate reading comprehension threshold at 95%.

The Marine Engineering Word List
Carefully examining the best way to apply the methodology stated above in our research so as to be relevant, efficient and justified, we took several points into account. Firstly, our target learners are either university-level students or active seafarers undertaking professional training. Furthermore, bearing in mind the fact that English is learned in many countries from an early age, we deemed that it would be reasonable to expect them to be competent in reading and understanding at least 3,000 basic English words (BNC/COCA), especially considering that it refers to the receptive knowledge and not necessarily productive language skills. This follows certain recent trends in expending the high-frequency list from 2,000 (West 1953;Nation 2001) to the first 3,000 English words (Schmitt and Schmitt 2014;Nation 2016). Additionally, adequate English proficiency is mostly required and tested by shipping companies during their recruitment procedures. Finally, since in many countries, officers' training courses do not need to be organized through university courses, but can be organised through any certified training centre, and taking into account the fact that this type of text is more technical than academic (although the AWL list coverage of 8.07% is not to be overlooked), we decided to head directly to the marine engineering word list (MEWL) for obtaining early specialization (Coxhead and Hirsch 2007) in English for Marine Engineering Purposes (EMEP).

Data analysis and results
In determining the frequency threshold for our list, we took into account the final objective of 95% coverage, the relevant lists to test against (and those to exclude from the count) and the reasonable list size. In the end, we opted for a frequency of at least 50, so the final list was formed of 337 headwords (see Addendum 1) with 73 transparent compounds (see Addendum 2). We did not apply the range and dispersion criteria here, since they are more relevant to huge corpora and extracting common vocabulary for various professional areas (e.g. Coxhead and Hirsch 2007) than one specific professional profile (e.g. Coxhead and Demecheleer 2018). In our research, the corpus comprises a variety of technical materials in terms of different machinery and ship's systems, all nearly equally important to the single occupation of marine engineers.
Aiming to obtain the most precise results possible, in analysing the corpus, we extracted the most frequent abbreviations and marginal words (the latter mostly typos and conversion errors), and added them to Nation's original lists, respectively. This selection process is never easy or perfect, since, for example, some abbreviations are recognizable (e.g. cyl for cylinder, hfo for heavy fuel oil), but sometimes it is difficult to distinguish between an abbreviation and a typo or conversion error. Fortunately, they were not numerous, owing to the carefully conducted cleaning process of the corpus, and, once added to one of the additional Nation's list, cumulatively, they do not affect the final results, apart from enabling the production of a "purer" word list.
In addition, in spite of our efforts to initially remove as many proper names as possible, so as to avoid the commerciality of the data, what remained was also added to the Nation's list of proper nouns. Taking into account that all the words considered above are expected to be easily understood and recognized from the context, thus not bearing a learning burden of significance, we excluded them from our further frequency analysis together with the most frequent GE words.
After having a close look at the Nation's list compounds, we deemed it too abundant for our target learners and significantly more difficult to learn compared to the most frequent general English words, thus we did not exclude it from our initial count. On the contrary, we decided to make a separate list of transparent compounds (Addendum 2) derived from the corpus itself. In that, we followed the recommendation of having a separate list of this type of vocabulary (Nation 2016: 70). Some of the items overlap with those from the Nation's list (e.g. setpoint, standstill) and the rest originate from the corpus (e.g. sootblower, crankthrow). Hyphenated forms are not frequent in this type of text, and they were initially eliminated by dividing their constituents into sepa-rate words. In addition, those compounds that could not be easily understood according to their constituents, were left as part of the initial list, since they cannot be deemed transparent (e.g. bulkhead). Here we must note that, although presented separately, they should be taken as one with the initial list of words, since they are of equal significance to both vocabulary skills and course designs and materials.
The final results are presented in Table 2. As we can see, the coverage of the first 3,000 most frequent English words (with the additional three lists, as usually excluded first in the analysis) amounts to 87.41%. If we compare this to other findings, we can see that the general English coverage is lower in our target corpus than, for example, in business research articles, where 3,000 words with proper nouns cover as much as 90.84% of the material, in business textbooks, where the coverage is 94.15% (Hsu 2014: 251), in popular science books where 3,000 words with the additional 4 Nation's lists cover 92.65% (Vuković Stamatović 2020) or in various types of texts where 3,000 word coverage is of the order of 90-93% (Nation 2006). The coverage is lower even compared to the coverage of the first 2,000-word BNC/COCA lists in some types of texts, which ranges from about 79% (in various academic text) to 89% in school journals and novels (Nation 2006(Nation , 2016Fraser 2007). This clearly points to the complexity of our target vocabulary in terms of its technicality. It also speaks loudly in favour of the need for a specialized word list to accommodate adequate coverage of the target genre. For the purpose of illustration, we can use another option of the programme to see the corpus text coloured as per the word lists (Figure 1), where (in the extract presented), for example, pink stands for the first 3,000 BNC/COCA words, orange for MEWL, blue for the abbreviations, while the words outside the list are left black.

Fig. 1: Level lists presented in colours
We can also present MEWL with compounds only using the same option, where green marks the MEWL words and transparent compounds are given in red, while the remaining are left black.

Fig. 2: MEWL and compounds' list presented in colours
This option could also be used for text glossing, which is often helpful in terms of reading technically demanding texts such as ours (Nation 2013). In this way, in addition to the glossaries frequently added to the textbooks, the learning materials can be compiled out of authentic texts (whether adjusted or not) with detected and glossed technical terms, followed by a definition, translation, or similar. The coverage of the "pure" marine engineering word list of 337 word families is 7.41%, i.e. 8.13% adding the list of 73 transparent compounds. The purity of the final list is attained by additional analysis of the list and detecting additional members to add to the Nation's original 3,000 families (e.g. purifier, cleanable, abnormal). Moreover, during the conversions, some technical words are not recognized, thus additional attention should be paid to "unclassified" words, such as e.g. bunker (bunkering), alignment (misalignment), and similar terms, which were subsequently added to the list.
In total, together with the first 3,000 GE words, proper nouns, abbreviations and marginal words, a level of 95.54% was reached, thus fulfilling our goal of attaining the adequate reading comprehension threshold. Taking into consideration that the desired level could be attained with not less than 12,000 general English words only, our final results perfectly fit the findings of Laufer and Ravenhorst-Kalovski (2010), by which the threshold of 95% is expected to be reached through the use of 4,000-5,000 word families.
The remaining share of the unknown vocabulary in our corpus would here form below 5% (4.46%), which means that the adequate reading comprehension threshold is comfortably reached. Adding the introductions, professional knowledge and abundance of schema and diagrams, the understanding of this kind of professional text should therefore no longer present any significant difficulties. However, an aid such as a glossary or dictionary for instruction books would certainly be welcomed by both (future) marine engineers and teachers of this challenging branch of ESP.

Glossaries, dictionaries and pedagogical implications
Having a ready-made list of the most frequent vocabulary extracted from marine engineering instruction books, the course material designers can then choose whether to use them as additional material, compile new textbooks and exercises with excerpts and examples taken from authentic instruction books, or else use the list in any other way they find suitable. Given the size of the list, it might form a convenient and achievable task for (a part of) university studies, where about 800 new words are considered a practical learning goal for each two years of study (Dang and Webb 2106: 174). Moreover, additional corpus linguistics methods can be introduced to students as well, and thus they could be required to find concordances, n-grams or full examples for the word use from the corpus itself. In addition, the list would be a more than useful tool for English courses held at seafarer training centres, since the attendees would generally have some experience of using onboard instruction books. Furthermore, any marine engineer would be able to make good use of having such a concentrated list of specialized vocabulary at hand during their onboard service. The list, therefore, could help in creating a monolingual or bilingual glossary, with or without pronunciation transcription.
Maritime English textbooks, like those of many other areas of ESPs, are often supplemented with a glossary, frequently a bilingual one (e.g. Spinčić and Pritchard 2009: 180-242) which covers the textbook syllabus. For this purpose, it would be sensible to use authentic texts, such as excerpts from various instruction books, and the word list might be divided into sub-lists or learning chunks in order to suit the organization of classes and exams. Frequently, textbooks cover more than one semester of studies or are printed in several volumes, which can be covered by a common glossary. Additionally, the list might also be added to existing glossaries to cover combined class material.
Considering stand-alone glossaries, which are frequently found online as useful links provided by various maritime organizations, institutions and companies, we first noticed the scarcity of specific marine engineering glossaries, in contrast to nautical or (general) maritime ones. The most of maritime glossaries are generally very simple in form. They most frequently offer just the target term or phrase with a simple context explanation (monolingual) or translation (bilingual), occasionally including additional details such as the phonetic transcription (sometimes offering both British and American variants) or phrases and collocations.
The extraction of the most frequent words is also applicable to a more comprehensive and demanding project, such as a marine engineering dictionary, whether that be creating a new one or adding to an existing one to make it more contemporary and comprehensive. Almost any dictionary has been built upon a corpus (Nation 2016: 176), and using the possibility of measuring the (lemmatized) vocabulary frequency range within it would certainly be an advantage in this regard (De Schryver and Nabirye 2018). Compared to word lists, which are intended to be learned for the purpose of unaided reading, the cut-off point for dictionaries can be moved to lower frequencies, so as to be more inclusive in terms of technical (low-frequency) vocabulary. A further advantage of the proposed methodology for building a technical glossary or dictionary is that, upon compiling a corpus of adequate composition and size, the building of the macrostructure would be solidly founded upon the frequency of the vocabulary beyond the most frequent GE words, the lists of which can be excluded from further analysis.
This kind of dictionary might also be supplemented by illustrations (e.g. Carić 2011) and/or examples from the corpus. This might prove especially useful, since using dictionaries to explore aspects of target words has been shown to be one of the most efficient ways of learning about vocabulary within the language awareness strand of learning strategies (Hirsch and Coxhead 2009;Macalister and Nation 2011). Here we must note that a technical glossary or dictionary would mostly not contain general English meanings and examples. For example, the word pinion would be here referred to as a type of gearwheel. However, when deemed necessary to include both, the notation GE (general English) or ME (marine engineering) would follow (e.g. average (GE) and average (ME)). More considerations on polysemous words in marine engineering are given in Section 8. Limitations of the study.
The lists of most common abbreviations should be added to the dictionary, which might also include more relevant details, such as the marked part of speech and further examples, or perhaps illustrations. The words can be followed by the frequency mark, such as, for example, in Macmillan's Dictionary (2012), bearing in mind the specific corpus it was derived from.
For the purpose of generality and following the organization of various dictionaries, we offer a monolingual simplified example for the most frequent word (appearing 10,871 times) in the instruction books: (1) valve /vaelv/ n. -a device that opens and closes to control the flow of a liquid or gas Following the main term and definition, AntConc (Anthony 2019) (or another programme offering a similar function) provides an excellent opportunity to check the concordances of certain terms from the corpus. Noun phrases and certain collocations containing the target term usually follow the head definition.
(2) valve /vaelv/ n. In addition to collocations, dictionaries often contain authentic examples of the use of the words within phrases and sentences, which can easily be found using the same programme. For example: (6) by-pass valve -ex. Adjust the oil pressure to a suitable level on the bypass valve.
Comparing the means of presentation, word families are generally used for word lists for practical reasons, whilst in practice they are expanded, usually to include all family members when used for the programme analyses presented above. For dictionaries (and glossaries) it is more convenient and useful to use lemmas, so that different parts of speech can be presented as separate items (for example actuate v., actuation n., actuator n.). For this purpose, again Familizer + Lemmatizer (Cobb 2018) might be used. For example, the single word family bear comprises 12 family members (word forms), whilst, for instance, in a bilingual maritime dictionary (Rapovac 2002) we can find the meanings and examples of bear, bearer and bearing, in various collocations and contexts. Based on the proposed software being used, we should note here that the selection should rely upon the frequency of tokens converted directly into lemmas (and accompanied by compounds and phrases) in order to reduce the interference of the software in the selection process during the list expansion and reduction processes. Further details on the organization of dictionaries vary according to their intended purposes and the choices made by the authors. Compared to the painstaking endeavour of finding relevant examples and the collection and processing of abundant material manually, such as, for example, was the case with the quotation slips used for the first Oxford English Dictionary, the idea of this paper was to present some of the available IT-supported methodologies of corpus linguistics, which make the process incomparably easier and faster, but also much more accurate, thus providing legitimacy to the use of this system.

Limitations of the study
Despite the best efforts of the author to apply the most accurate methodology and make the best possible and reasonable decisions along the way, we accept that neither the methodology nor the results are ever perfect (Nation 2016: 182). For example, the conversion of large instruction books, some running to as many as over 800 pages of scanned material, resulted in many typos and conversion errors. This added to the words outside the lists, whilst a number of those should add to the frequencies within them. The programme itself does not recognize different spelling options (e.g. authorized and authorised), and thus counts them as separate words with separate frequencies, where the frequency should in fact be cumulative. Moreover, multiword units are always a point of issue in this kind of research (Nation 2016). For example, some of them can be written separately (e.g. cam shaft), and some as a single word (camshaft), which makes the statistics in terms of frequency not entirely precise. Additionally, careful attention should be paid to the words "unclassified" by the programmes, which are often the "missing" members of the general English word families or unrecognized technical words. In particular, general programmes such as Familizer + Lemmatizer do not recognize certain words from the marine engineering register, such as, e.g. crosshead or bedplate, and thus we had to retrieve many of them from the "unclassified" word types and manually add them to the lists used or produced by the programme.
When it comes to such a technical branch of ESP as marine engineering English is, additional attention should be paid to semantic issues. In particular, a certain portion of the words could be classified as belonging to Step 3 of a four-step rating scale for technical vocabulary (Chung and Nation 2003: 104), being referred to as polysemous or cryptotechnical words (Fraser 2009). This means that some of the words classified as belonging to the most frequent general English words often have a completely new and technical meaning in marine engineering contexts, either individually (e.g. wear, draught, average) or in collocations (e.g. jacket water cooling system, guide shoe). In the subsequent table, we present some examples from the first 3,000 GE words which have different meanings in marine engineering English, either individually, or in collocations. The definitions are somewhat shortened and simplified for practical reasons of presentation. Here again, we have to mention the conditionality of classification, since many of the technical terms are shared among various fields, in this case, especially with general and other branches of engineering. Overlaps are not a rarity in the produced word lists, which was the reason that some authors have decided to build common core vocabulary word lists, such as the Academic Word List (Coxhead 2000), the Science-Specific Word List (Coxhead and Hirsch 2007), and other similar lists.
When it comes to bilingual formats, things get even more complex, depending on the lexicographical features of the local language in relation to the maritime sector, which, most usually anticipates the colloquial use of numerous Anglicisms and words borrowed from Italian or Spanish, for example. For all the reasons mentioned above, this kind of analysis cannot be left to pure statistical processing; it is evident that human (expert) intervention of close account and expertise is in fact required every step of the way.
Considering the corpus itself, for university students, for example, it can be broadened by the use of textbooks and possibly scientific articles, although the latter can make a separate academic corpus to examine in terms of lexis. The reason for this could be the contents of the list we produced. We can easily spot words from outside the marine engineering field, but which are related to, for example, physics (e.g. amplify), medicine (e.g. diaphragm), electrical engineering (e.g. coil), and so on. Therefore, the list as is could be named more precisely as "the word list of ship instruction books", and new and additional marine engineering word lists might be developed from a wider or different corpus.
In order to overcome the possible shortcomings of the overall process, we invested out best efforts in following and taking into consideration various previous findings and recommendations. Among those, we tried to explain the application of the methodology and the decisions made along the way. This provides justification and makes the results clearer, but might also help other authors in building relevant specialized lists of other technical areas and, indeed, other ESP learners.

Conclusion
Analysing the most practical language needs of our target language learnersin this case future and active marine engineers following English for Marine Engineering Purposes courses, we embarked on the ambitious project of collecting, selecting and analysing their key corpus of marine engineering instruction books in terms of both vocabulary types and the overall lexical burden. Following certain previous research findings, we applied the lexical profiling methodology and some of the most updated software for the creation of a specialized marine engineering word list. The final list comprises 337 word families plus 73 compounds and cumulatively covers 8.13% of the target corpus. Together with the 3,000 most frequent English words (BNC/COCA) and an additional and broadened lists of proper nouns, marginal words and abbreviations, we succeeded in reaching the desired coverage level of 95%, more precisely, 95.54%. This means that with a knowledge of the 3,000 most frequent English words and familiarization with a range of the most frequent proper names and abbreviations, with MEWL, less than 5 out of 100 words of this demanding type of technical material should remain unfamiliar to the reader, a percentage which should not significantly affect the comprehension of the text. Compared to the about 12,000 general English words initially needed to achieve a similar result, the amount of vocabulary required is incomparably reduced, and as such, the list we have produced should certainly provide for early specialization in English for Marine Engineering Purposes. We have also put additional efforts into explaining the methodology and its application, and providing explanations and justifications for the decisions made during the (replicable) process and the final cut-off points, in order to make our own modest contribution to future research in the field and the enhancement of the available methodology. Moreover, we sought to provide practical suggestions for building (marine engineering) glossaries and dictionaries based upon a similar methodology, which will, hopefully, be a matter of interest for other ESP teachers and lexicographers, just as it is for the author of this work.