Faculty Interview: Nina Vyatkina
Interviewed by Brendan Allen, November 2012
Would you like to introduce yourself?
My name is Nina Vyatkina, and I’m an Assistant Professor of German Applied Linguistics in the department of Germanic Languages and Literatures at KU.
Just to start off, what would you say is your specialized research field?
The major focus of my research program is the longitudinal development of foreign language abilities in college students. Basically, I study how students at KU learn German. I’m interested in tracking them from the very incipient levels of proficiency. I’m particularly interested in the following research questions: How do learners’ foreign language abilities develop over time? Do they acquire patterns of the target language abruptly or gradually? How do specific methods of instruction and educational practices influence learner progress?
I seek to answer these questions by collecting and analyzing a large database of samples of learner production, namely writing. Obviously, in foreign language classes, students produce essays, even from the beginning levels of study. I’m collecting these samples over time, and at dense time intervals. I’m basically following the instructional progression. I’m not taking students to a laboratory or conducting experiments with them. Everything I study is embedded in the curriculum. In this sense, I can say that I use authentic tasks for my research – in the instructional, curricular context.
When you say “dense time intervals,” what exactly do you mean by that?
That’s a good question. In longitudinal research for social sciences, data collection points are typically spread over several months. But for my research, it’s three to five weeks because students will write something, say, according to a task in a specific chapter they are studying. I’m currently tracking students over their first semesters of study, so that’s a lot of data.
Which classes are you studying?
I’m actually coordinating the German language proficiency sequence for students who are taking German for their language requirements. So these are beginners – they start with zero language skills. Of course, we also have “false beginners,” so to speak, who have had some German but still enroll in first semester German because of their placement, or because they didn’t get much out of their previous studies and forgot it, or they just want a fresh start. But many of them are zero proficiency students, and I track them over four semesters of study.
I imagine you have some sort of system to compile so much written data. How exactly does that work?
I’m using FileMaker Pro; it’s a data archiving software. It allows you to store primary data, which, for me, is these writing pieces, along with metadata. I need a lot of metadata – time of creation, level of the student, number of the student. It’s all anonymous; we assign students numbers, codes, or pseudonyms.
It’s all stored in relational tables, so you can retrieve any data from FileMaker Pro to extract whatever you need. For example, say you want to look at just data for essay number one in the first semester – you can just enter your metadata into the search engine and extract all the essays that you need.
In terms of the actual data, have you found any specific patterns in how students are picking up the language? Or is it too early to ask?
It is a little early. With this particular project, for which I got the seed grant from Digital Humanities, my task was to provide data with annotations by tagging the corpus. So far, I’ve only told you about the raw learner language corpus, just the text of the essays. But we need target features, so I’m really looking at complexity development in learner production.
Maybe I should take a step back. One of the most developed directions in longitudinal learning language research is studying complexity, accuracy, and fluency – three dimensions of language production. As you can see, these are interrelated but are somewhat different. For example, we want learners to start using more and more subordinate clauses, complex sentences, complex vocabulary – but this is a different dimension from accuracy. They may interact in complex ways obviously – if you start using more difficult language then your accuracy decreases and so forth. Fluency is another separate dimension – the amount of language you can produce in a unit of time.
How have you used this “tagging” system to annotate your raw data?
I first started looking at complexity, while ignoring the errors altogether. For example, how does the syntax develop? To help find that, you need to tag your data for subordinate clauses. So – my question was, can computers and software help us with that? Because if you deal with such massive amounts of data, you need help.
And there are some digital tools out there: There are part-of-speech taggers, as well as syntactic dependency parsers, which work well on native language data. Such parsers exist for both German and English. So we have such parsers, but how do they perform for a learner language, where there are errors?
Now the question is, how do you tag learner data, and what does it mean? A learner can attempt to produce a relative clause, but what comes out may be something completely different and unrecognizable. So I partnered with a computational linguistics team at the Humboldt University of Berlin, led by Dr. Anke Lüdeling. They have a well-developed approach to tagging learner production data, including with learner German. However, they have dealt with more advanced learners so far – I’m dealing with beginners, where it’s even more complicated. But they have experience, and they have already developed a database that is publicly available and searchable. I contacted them and said that I was interested in adding learner corpus to the database, and they were very much interested.
The project is still ongoing, but we’ve pretty much finished the first stage. We are applying a combination of automatic and manual tagging. Basically what I needed this seed grant money for was for human annotators. – machines are unreliable for annotating learner language. I employ two graduate students in the German department who tag the data manually – for different linguistic categories, but also for errors.
We employed a system suggested by our German colleagues that has multiple layers – you can add an indefinite number of layers to the data. It’s divided by “tokens,” which basically means words and punctuation marks. Each essay is a stretch of tokens. They are basically spread into a line, and then divided into segments, and under each segment you can put your annotation. So first my human annotators created target hypotheses. They’ve also experienced German teachers as graduate teaching assistants, so they knew the tasks that we gave the students and were in a position to guess what students intended to say. They basically corrected learner errors, and underneath each token they put their own corrections.
The next layer was differences – they marked what [the annotators] actually did: changed something, moved something, inserted something. We have special tags for the changes that the annotators make. And then, when studying complexity you can run an automatic tagger on the corrected data, which can find your relative clauses or subordinate clauses- whatever you’re interested in. In this way you can study complexity independently from accuracy.
So why use two human annotators? In order to calculate the reliability of our coding. When you deal with humans, they also make mistakes or may have different opinions. Target hypotheses can differ. They tagged the same set of data independently, and we compared the results – if there were disagreements we discussed them along with our German colleagues, because they have a very detailed manual for such annotations. Then we improved the procedure – at the end it was almost 100% agreement on the annotations.
We have a database with approximately 70,000 words by now, with one cohort of learners that have progressed over four semesters. We have all of this data coded for target hypotheses, and also for corrections that the annotators made. My German colleagues also ran all this data with automatic language taggers and parsers – we have lots of layers of annotations.
Just today, my colleague from Germany who’s doing the computational side – I’m not a computational linguist, that’s their expertise – he just sent me the finalized database that is going online. I’m going to proofread it and open it to the public. It will be completely free and available to any interested researchers. So you can use it as a comparison to your own data, or ask your own research questions because it’s so richly annotated on different levels. We will provide all of the metadata description as well. It is supported at the Humboldt University of Berlin, on their server. I’m going to provide links in KU Scholar Works and on my personal website so that other researchers can access this database.
Does this database have a specific name?
KanDeL – Kansas Developmental Learner
Are specific questions emerging about the data you’ve collected?
I have a lot questions – first, how do students develop the writing style from a more verb-centered style, or a more narrative style, to a more noun-centered style, which we expect to happen at later stages of development. I want to start with just one specific syntactic construction: prepositional phrases. When you think about it, prepositional phrases can function in two distinct ways: You can say something like “I went to New York,” and that would be a verb phrase. The prepositional phrase “to New York” would depend on the verb. But if you say “my trip to New York,” that would be a noun phrase because “trip” – the head of this prepositional phrase – is a noun. “My trip to New York” lends itself to more argumentative writing styles, to more public discourse. Students are expected to move from primary discourse – talking about themselves, what they’ve done – to a more abstract level of writing, which is centered in using more nouns in general.
Using an automatic tool, you can extract any prepositional phrases. But, the automatic tool also allows you to extract prepositional phrases that modify verbs separately from the ones that modify nouns. That way, you can see if the frequency of the verb phrases decreases over time, and if the noun frequency increases over time. That would show you if students develop from a more verbal style to a more nominal style, which you want to see. But we expect that not all students will do that – I want to see how these developmental paths differ, though I hope to see some development toward nominal style to some extent.
So there’s a pattern that emerges when you see this data, one that moves towards the more nominal style?
Yes, and this is where the Digital Humanities helps us – finding patterns. You can visualize your data and draw patterns. If you just look at raw data, it’s very difficult to see that there are students using more verbal or nominal prep phrases. But if I tag them, say, with all of the modifiers in blue, and all the non-modifiers in red, then I can get a very nice, colorful picture that visualizes the data.
What types of visualizations have you found useful for representing your data?
Different types of charts, line graphs. I have used word clouds – not with words, but with parts of speech. You can easily see if students use more verbs or nouns within a word cloud. You can also tag how discourse moves. You can explore narrative structure, for example. Each narrative has an introduction, development of the story, a climax, and then a coda – you can tag these in your text and study patterns.
What might be some broader applications for this research?
One broader application would be to apply this process to other languages, though another would be educational. I am planning on designing courses for both graduate and undergraduate students. How can you use corpus analysis tools and annotated data for language acquisition classes?
These tools can also be applied to any humanities research, if you need to annotate data. We’re all looking for patterns, no matter what we study.