- U sing Leamer Corpora . for L 2 Lexicography : . Information on Collocational Errors for EFL learners ' :

In this paper, we describe an on-going project of the corpus of EFL (English as a Foreign Language) learners in Japan and its application for pedagogical dictionary compilation. We especially focus on the learners' errors in verb collocation patterns and describe how the leamer's dictionary can benefit from the learners' error information based upon the learner corpora. .


Introduction
Recent development of corpus linguistics and actualcorpora has been remarkable.Many dictionaries published recently all ,enjoyed.in some way or another the use of large corpora; for example, the COBUILD English Dictionary used the Bank of English, the corpus of 20 million words in contemporary English, developed at the Birmingham University.Longman Dictionary of Contemporary English and Oxford A4vanced Learners: Dictionary of Current English used the British National Corpus, produced by an academic and industrial consortil,lm.consisting of Oxford .. University Press, Longman" Chambers Harrap, Oxford University Computing Services, .Lancaster University's Unit for Computer Research on the English Language and the British Library .. Older corpora for research were all gathered by the institution ICAME (International Computer Archive of Modem English), which is an international organization of linguists and information scientis~s worl~ing with English m,ac?ine-readable texts.The aim of the organization is to collect and distribute information about English language material available for computer pr6:Cessing, and about lingtiistic research on this material, completed or' in progress, in order to compile an archive of English text corpora in machine-readable form, and to make material availitble to research institutions.
" . .Even: though many different kinds of corpora have been av~able internationally, very few researchers have yet built up' a corpus of the language learner.There may be a couple of'reasons for this; first, the data collected froin language learners is in.most cases erroneous.The primary interest of'the corpus builders at present is to describe the status quo of native speakers' language, so they are basically' not interested in collecting learner langUage data.Secondly, and related to the first, most of the' researchers in applied liriguistics or TESL (Teaching English as a Second Language) / TEFL (Teaching English as a Foreign Language) are not sufficiently informed about the expertise ofcorpus linguistics.They are either more or less classroom~riented researchers or theoreticians like UG-based SLA researchers (Universal Granun.ar-basedSecond Langu~ge Acquisition• researchers), who stress' the intuitionaf' the' native speaker, rather than the collection of a large text.
More and more attention has been paid, however, to building a corpus of the language learner.To date, International Corpus of Learner English (JeLE) has been one of the largest and the most systematic corpus development projects in the world (Granger 1994).Longman has been developing Learners' Corpus for its dictionary project l .John Milton of Hong Kong University of Science and Technology has already collected about 8 million words of the writing of Chinese students of English2.In Japan, Asao and others held a symposium on EFL learner corpora and SLA (Asao et al. 1995).Our project is also one of the few attempts to develop learner corpora.
In this paper; we will first describe our learner corpus project and then show the application of learner corpora data to English pedagogical dictionarymaking as an example and examine the potential which learner corpora have for future L2lexicography.

TGU Leamer Corpus Project
Tokyo Gakugei University, a national teacher training college in Tokyo, Japan, launched the project on EFL writing instruction in 1988.The primary research interest was focussed on the effect of teacher feedback on EFL writing quantity and quality (see Hatori et al. 1990;Kanatani et al. 1993;Tono and Kanatani 1995 for more details).Throughout the data-collection procedures, we collected the free composition data in English from subjects of different academic backgrounds (eighth grade through twelfth grade) and accumulated the data in a machine-readable form.
. Table 1 shows the framework of the corpus.The data collection procedure has been larg~ly dependent on the research design of the original writing project.Therefore, the learner profile does not seem to be entirely systematic.For instance, the data for third-year senior high school students was obtained for the first project in 1989 and the number of the subjects was 280.But the following project in 1993 only allowed for 120 subjects for each grade.The data for SHI was not obtained at the time of the second project in 1993, because the primary focus in our original project was to see if teacher feedback on writing did make a difference as the academic grades increased.This is also the reason why we did not obtain the data for SHI which would enable us to see the differences between JH groups and SH groups clearly.We will have to fill the gap by collecting the data for SHI in the near future.We should also note that the data available for the present study was limited to the subcorpora except for SH3.Because of leamer profile database management problems, we could not use SH3 data for the analysis .ofcollocation errors.
The size of the whole corpus is about 0.7 million wprds.As can be seen in Table 1, the size becomes larger for upper-grade groups because more advanced students wrote longer essays.We will have to collect more data for lower-grade samples to create a balance in size among the different academic grade sub-corpora.This corpus is one of the largest learner corpora available in Japan and probably one of the first attempts in the world to collect the interlanguage data from different developmental stages.

Collocation errors of English basic verbs
In order to see how the learner corpora contribute to L2 dictionary-making, let us look at the actual data taken from the corpora and discuss its application for lexicographical description.Since it was impossible to examine all the lexical items in the corpus, we chose basic verbs and their collocations for analysis.
Table 2 shows the list of the verbs used for the study.As Sinclair says, in order to study the behavior of words in texts, we need to have available quite a large number of occurrences."About half of the vocabulary-of a text -even a very long text -consists of words that have occurred once only in that text."(Sinclair 1991: 18) In order to get statistically meaningful results, we have to obtain enough observations for each lexical item.In this sense, it was difficult to deal with lexical items whose frequencies were relatively low.Since the size of our learner corpus was around 0.7 million words, it was almost impossible to have enough occurrences of each of the basic verbs listed in Table 2.
Another alternative was to choose high frequency words such as the, of, and, to, a, in, that, 1, it, and so forth.They are the forms which occur so frequently that there is no problem to apply statistical procedure to those items.However, as can be seen, most of them are so-called function words and the behavior of these words is rather fixed.We thought that it would be more interesting and of more central importance to include basic verbs in our scope.This does not necessarily mean that the study of those functional words is unimportant.We would like to deal with those items in future research.

4.1
The basic procedure of text processing The basic procedure of text processing is shown below: 1995) In our analysis, we did not have to preprocess the data as such because they were not taken from electronic sources or OCR.The problem, however, is that we transcribed the composition in a Japanese word-processed format, so all the Japanese' characters were typed in Japanese.This made it difficult to compare our data with the data from the Bank of English by PC-DOS programs such as LEXA.All the Japanese characters werejust unrecognizable on the program.
Tagging and low-level parsing are necessary steps that must be taken in order to investigate the syntactic behaviors of the words in depth, but in this instance we could not use these procedures.The main reason was that the normal tagger or parser did not work correctly on erroneous texts.Therefore, if you are serious about tagging the learner data, you have to do it manually, or first run the automatic tagger and then correct the text manually.This will be one of the biggest obstacles for further research in this area 4 .Very few studies have been done on how to systematically tag erroneous texts (see, however, Meunier forthcoming).We believe that we will have to overcome this problem in order to fully appreciate the benefit of learners' corpora.
We obtained three different types of statistics for each verb lemma.Here we have to clarify the use of the terms.A lemma is what we normally mean by a 'word.'Many words in English have several actual word-forms -so that, for example, the verb to give has the forms give, gives, given, gave, giving, and to give.
In this text, the composite set of word-forms is called the lemma.This definition is based upon Sinclair (1991: 173).The three statistics are frequency score, MI-score and T-score.Let us take a closer look at each of these.

Frequency scores
The simplest way to look at the corpus data is to get a frequency list, i.e. how often each different word-form occurs in the text.There are a couple of ways to arrange the list.Sinclair (1991: 30-31) has described three ways: first, turning the text into a list of the word-forms in the order of their first occurrence, noting the frequency of each; second, sorting it in an alphabetical order; third, sorting in a frequency order.In either case, it is very easy to compare relative occurrences of each word.
Let us look at an example.Table 3 shows a part of the frequency list of the verb bring in our learner corpus.This data simply tells us that the most frequently occurring words with the lemma bring are my (69 times), out (62 times), a (26 times), the (12 times) and so forth.It indicates that the learners use this verb with noun phrases and phrasal verbs such as bring out.

Mutual Information Statistic
The mutual information statistic was first introduced for corpus analysis by Church and Hanks (1990).It basically works as a tool for identifying interesting associations among' words in a corpus.Suppose that we saw the sequence "bring a " showing up a number of times in the concordances to BRING and wanted to know if there might be a linguistically interesting pattern.Some sequences in the concordances are interesting (e.g.bring out), but others such as bringa.are not, even though they may be quite frequent.Mutualinformatlon can help distinguish the more interesting sequences from ;the less interesting ones by comparing the joint probability of the sequences with chance.Pairs of words with high mutual information scores.are likely to be interesting to a researcher.(For more details, see Church et al. 1991;1994) Table 4 shows the mutual information statistic for bring.

t-scores
The t-scores compare probabilities that "a thi~d word co-occurs with either of two-words" (Grefenstette 1995: 61).For example, we are interested in which is more common to say, powerful tea or strong tea.The t-scores will statistically examine whiCh words are significantly more likely to appear after strong than after powerful (Church et al. 1991: 125).Table 5 shows the t-scores of the verb bring: .

Procedures for collocation data analysis
After choosing the basic verbs, we first obtained the frequency lists of each kind of verb form, Next we picked up collocation errors from the lists.Since we had not tagged all the texts yet, we could -not pick up errors according to the parts of speech information.Instead, we identified the errors by looking at the first words which immediately followed the node words (in this case, verb lemmas).

The results-of verb collocation analysis
Table 6 indicates the relative frequency of the basic verbs selected for our study.The verb selection was made according to the frequency data of an English learner's dictionary.The frequency list indicates that even though we chose 70 different verbs, more than a half of them could not actually be used for our study5.For example, the verb carry occurred only 15 times in the whole learner corpus data.It is very unlikely that any interesting error pattern would appear in such small samples.If we try to generalize any particular pattern by statistically judging its probability, then we need at least more than 10 expected frequencies in each ce1l 6 .In our case in Table 6, only a small number of verbs such as become, bring, come, go, get, 1IIlve, make, play~ see, take,' think, and want meet this condition.
Table 7 shows the list of verb collocation errors.The number of the learner errors obtained from the individual composition tasks was rather limited.The main reason for this is that for our free composition tasks, we did not use a multiple-draft design in which the subjects were asked to rewrite the same drafts again and again.Instead, we used different topics for each writing task.Therefore, it was more difficult to collect the data of the same error patterns or corrected forms of the same verbs in different compositions.In spite of the difficulties in data collection, itstill indicates some interesting error patterns of the basic verb collocations.We will discuss the results and implications for L2 dictionary making.

6.
Integrating the error information into lexicographical description The learner corpus data shows that the learners fixed error patterns in their use of verb collocations.There are many possible sources of errors such as interlin-gualerrors (overgeneralisation from L1 structures or semantic or lexical structures) or intralingual errors (overapplication of L2 rules, etc.}.Whatever the sources, it would be useful for the learners of English to find the information on frequently occurring error patterns.Let us look at some of the common error patterns for EFL learners in Japan and how we could integrate such error information into the dictionary design.

Errors of verb meanings
The results show that the learners had a tendency to use wrong verbs which were quite similar in meaning.*Become to do, for instance, is a literal translation of the Japanese phrase "suru youni naru".Learners usually learn the meaning of become as "naru" and come as "kuru".For Japanese learners of English, the word become is more strongly associated with the phrase "suru youni naru" (come to do) than the word come itself.This kind of error is caused by L1 transfer of verb meanings.The same type of error was• observed in the phrase such as *look a dream (in Japanese, is used the verb miru (look; watch) for "have a dream"} or *take concert (which means "have a concert").
In L2 lexicography, therefore, it is very important to provide usage notes on frequently occurring errots such as *become to do under the entry come or become.Such learner errors have been ignored in describing a lexical entry, but if it is designed for language learners, the dictionary should contain such information in problematical areas for learners.

Errors of verb patterns and collocations
The data also shows that learners make quite systematic errors in the use of prepositions or particles after verbs.For instance, many subjects dropped the prepositions in phrases such as "come to ... ", "come back to ... ", "go to ... ", "look at ... ", "think about ... " and so on.In Japanese, no prepositions are needed for these verb expressions, so this might be another case of L1 interference.It is also quite confusing for Japanese learners of English that some of •these verbs could be used without the prepositions if the following elements are adverbials (e.g.Come here.Come back here.Go home.).Therefore; knowing which prepositions or particles should follow the verbs is also another problematical area for Japanese learners of English.
Another common error is to use wrong verb patterns.For example, *go to shopping instead of saying go shopping, or *want do for want to do.These grammatical patterns are very complicated for Japanese learners and they have to learn the behaviors of each verb one by one.; Currently most bilingual dictionaries in Japan and monolingual learners' dictionaries such as LDOCE or COBUILD all provide useful grammatical codes for these verb patterns.The 123 information on the most difficult verb patterns for certain groups of learners, however, has not been fully investigated and described in a dictionary.
In pedagogical dictionaries, more and more information on these collocation errors of "verb + preposition / particle" or other verb patterns should be systematically provided.Especially, the learners should be warned of not only possible errors but also frequent errors by collecting more data on learner English.For advanced learners, the collocation information for the verbs or nouns at 5000 to 7000 word levels is very important, but not many dictionaries offer useful information in a systematic way for this level of lexical items.

Conclusion
So far we have seen how the learner corpus can contribute to the systematic analysis of learner errors and how those errors should be dealt with in dictionaries.The effect of negative evidence (i.e. the information on 'something is not possible') in dictionaries is to be empirically tested, but it is worth noting that the information on L1-related errors or the most frequently occurring errors can provide the L2 dictionary users with useful guidelines for correct usage.Some bilingual (English-Japanese) dictionaries in Japan contain this kind of negative information, but there are still many editors and lexicographers who have reservations about providing "incorrect" usage in a dictionary.This question, however, is worth investigating empirically and more attempts should be made to improve the design of pedagogical dictionaries in order to best suit the needs of language learners.Notes 1.
Longman is said to have about 8 million words in its leamer's corpus.(P.Scholfield, personal communication) 2.

3.
Some researchers classify tagging and parsing as one of the preprocessing stages (for instance, Church et al. 1991).This stage was based upon the lecture given by Gregory Grefenstette in the Seminar on Computational Lexicography at Kossuth Lajos University in Debrecen,Hungary, from Nov. 27 to Dec. I, 1995. 4.
I would like to thank Fanny Meunier for her helpful comments on the problem of error tagging. 5.
These frequency scores were based upon the data from JH-2, JH-3, and SH-2 because of the teclmical problems we had at the time of data analysis in 1995.This is why the frequency 6.
scores were rather small in size.Brown (1988: 190).Usually th~ occurrences of certain lexical patterns are regarded as frequencies of certain categorical variables.Therefore, non-parametric analysis such as chisquare or phi-coefficient is suitable.!make NP to do 1 Who made.•1lCOJ)1e to eat three meals?
!play -<no ~ep) NP 2 I used OTOSHIDAMA for buy the clothes and play •city.
be run by NP 3 I was being run •by a bad man.
say NP+ NP 1 I want to say *my classmate "Thank: vOu•.
showtoNP 1 I want show *to everyone as many as I could.
sit (no prep) NP 4 I can lsit •the seat in train.
speak (no prep J NP 1 He spoke •many people about his adventure. stand (no prep) NP 1 I hope that I will stand •the stage.
stop (no prep) NP 1 That trains dOnl stop •station that I was there.
tDke (wrong) NP 1 A American rock band come to Japan and take •concert.
be "'"lten NP 2 A paper on my desk is written ••'ook behind you.•Note: The column n indicates the number of subjects who made these errors.

~
Gateway under licence granted by the Publisher (dated 2011) http://lexikos.journals.ac.za ~ Reproduced by Sabinet Gateway under licence granted by the Publisher (dated 2011) http://lexikos.journals.ac.za ~ look a dream1 I don' like to look •a dream

tallc NP 2
He wouldnl talk •anyone else about his storY tell NP NP 1 ( nuW not teU .•others it.think (no prep) NP 15 I canl think •an things think VP 8 Ithougbt •buy new ward-processer tum back + Adj 1 he tried to turn •back young, tumNP 3 After he turned •old man, wantVP 15 ( want •be taU wish i/ ... I I wish •if there were some.

Table 1 :
Framework of TGU Leamer Corpora ProjectThis project was originally conducted for comparing the junior high school groups and the senior high school groups in term of the effect of teacher feedback on composition drafts.The data for SH 1 has not been taken because of the technical problems of our original research design.Urashima Taro is a traditional Japanese folktale.Urashima saved a turtle and it took him to Sea Paradise and there he received a beautiful gift box.When he was back on shore, he opened the box and became an old man.The students were asked to write what happened to Urashima after that.

Table 7 :
The list of collocational errors of the basic verbs It became bigger and ~er and chan2ed •Doraemon IP. be chanl(ed NP 2 He was cbanRed •bis seikaku IP. come (no prep) NP 9 Why don't you come *my school festival?come back NP 8 So he was glad to come •back bis house.Il(et use to NP 1 we got -use to it when we were going ... Igo (no prep) NP 13 I went •cocert this grouD with my friend.In the band, I belD -making J)OStClS.hold ( be held) 7 The festival •held for three days.If you were to keep *to take no breakfast.
living *a wonderous room.