Text+ User Story

Compile your own learner corpus – with Weblicht

Christian Mair (Albert-Ludwigs-Universität Freiburg)

DFG subject area : 104 Linguistics

Text+ data domain: Collections


My ‘user story’ is about digital infrastructure in applied linguistics, specifically in second-language acquisition (SLA) research. One of the ‘classic’ linguistic foundations of this type of research was provided by contrastive linguistics, which tended to emphasize mother-tongue ‘interference’ as a major source of errors in the foreign language. This approach is still useful to the extent that it accounts for a fair part of teachers’ practical experience in the classroom, but has long been discredited as a stand-alone comprehensive theoretical framework accounting for degrees of difficulty in foreign-language learning.

In this situation, the advent of large machine-readable corpora of learner language – English in my case, but increasingly available for many other widely taught foreign languages – has been a welcome improvement. It has helped research on errors, by going beyond traditional ‘error analysis’ and making possible the study of features of learner language that are not directly related to errors, such as over-representation and under-representation of structures, stylistic monotony and unidiomatic uses of lexical items and phraseological chunks. Beyond their use in research, these digital corpora / archives / collections of learner language have enormous potential in teacher training, both during the student phase and in later in-service training. 

Working as a linguist in an English department offering a major teacher-training programme, I would like to report on small-scale experiments illustrating how existing infrastructure can inspire customized further infrastructure development that meets the needs of specific communities. I see them as pilots for more long-term and sustainable infrastructure development.


The main objective behind the activities I report in this user story was to change passive users of existing digital infrastructure into active agents, with an understanding of the relevant principles of infrastructure development and an ability to (i) critically reflect on the potential and limitations of existing infrastructure and (ii) to articulate their needs and requirements for further infrastructure development.

In linguistics classes aimed at future teachers I introduced state of the art collections of learner language (International Corpus of Learner English, ICLE, with written, undergraduate student essays; Louvain International Database of Spoken English Interaction, LINDSEI, conversations between native-speaker interviewers and learners). In a second step, I asked students to collect digital samples of their own foreign-language writing and develop them into personalized learner corpora, using Weblicht (a robust and easy-to-use general-purpose corpus-linguistic tool) that was developed as part of the CLARIN‑D infrastructure and will be maintained and developed further in Text+. Students generally used existing corpora skillfully, although somewhat uncritically. This changed immediately once they turned to their own data and Weblicht. While the absolute amounts of data thus processed remained relatively small, typically ranging between 50,000 and 100,000 words of text, students showed an extremely steep learning curve regarding all stages of the corpus-linguistic work flow, from data collection and mark-up through tokenization, lemmatization, part-of-speech tagging and – in some cases – parsing.


Using the Weblicht tool in the way described helps future teachers understand learner language much better than is possible through traditional methods. More importantly, it transforms them from passive users of existing digital infrastructure to active developers, who can combine existing components, tools and services to serve context- and task-specific needs as and when they arise – both during their studies and later, in their professional lives. 


I would assume that the particular challenge illustrated in my user story can be generalized. Whether the data are learner language, parliamentary speeches, any other kind of discourse or even multimodal (‘Text+’), it will always be beneficial for users of digital infrastructure to develop a deeper understanding of the development of digital infrastructures. My pilot activities in this area started as a personal initiative. I did not consider long-term sustainability, systematic dissemination and coordinated maintenance and expansion of the infrastructure itself. This changed when I joined a CLARIN‑D discipline-specific working group, whose responsibilities included liaising between users and developers. In my view, the further development of resources such as Weblicht (and numerous others) in the context of Text+ and the NFDI, and in constant dialogue between funders, planners, developers and users) will ensure precisely this type of sustainability.  

A large-scale ‘roll out’ of this approach, reaching the entire teacher-trainer and teaching communities, is desirable, but requires the stabilizing framework of an NFDI-consortium such as Text+ for long-term success and sustainability. Off hand, I can think of numerous worthwhile national learner-corpus initiatives. Some questions – for example what languages should be considered for coverage through learner corpora in the German context – need coordination but are unlikely to arouse controversy. Others are potential political ‘grenades’, for example longitudinal studies of learner competence in foreign languages through time or regional comparisons of proficiency at Abitur level in the 16 ‘Länder’. In both cases the organizing support and institutional clout of a national initiative within NFDI will be sorely needed. 

Review by community 

Yes, I would be happy to help review services provided by Text+.