Today was the fourth day of ALTA’s Machine Learning for ELT Summer School in Crete and we focused on Psychometric testing for the first part of the day and then vocabulary acquisition later in the afternoon. As with other days, this summary is the information as I understood it. I welcome all corrections, clarifications and comments.
Day 4 began with a workshop from Mark Elliott of Cambridge English on the Psychometrics of Language Assessment. Firstly Mark introduced the underlying challenge and motivation for psychometric testing: how to effectively and reliably rank test takers on an ability cline.
Classical testing theory used to try to answer this challenge by saying that a person’s observed score in a test was equal to their true score (what they got right that they should have got right) + the error (any changes in score due to luck, attention lapse etc.).
This approach had a number of disadvantages:
- Ability is test dependent
- Item difficulty is test dependent
- Ability and difficulty are not ranked on the same scale
- Score are not an equal interval (the same difference in raw score at different places in the overall cline will mean different things in terms of ability)
- The approach is test oriented rather than being item oriented
- Missing data causes significant issues
Enter Rasch and other Item Response Theory methods of test and item design, where the challenge level of individual items is calculated.
For each item in a test, we can . From sampling a test item in a pre-test, we are able to find see which people get it right and wrong and what the overall ability of those test takers (referred to as persons rather than people!) and create an Item Characteristic Curve (ICC). From this data we are then able to work out the probability of people getting the item correct, basically how hard it is, and for whom.
This data can then be used to make the item better before it gets used in the real test (e.g changing a distractor to make it easier/harder or more focused on the target it’s testing). Or once the test is live, this data can be used to shift around the raw score requirements for certain proficiency bands – e.g. if one version of a 40-item test contained more challenging items than another version, the data could be used to calculate the difference in raw score required to pass in each version.
The fundamental of the Rasch approach is that the level of challenge of an item can be calculated independently from the ability level of the test taker. A famous Rasch quote is that the comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and vice versa (Rasch, 1961).
Mark mentioned some fundamental differences between IRT and Rasch models and that the two ‘camps’ often disagree. Could anyone point to a link or describe this disagreement/difference as I didn’t pick it up.
We then went on to look at three different types of Rasch model and various techniques for testing how good an item is by seeing how well it fits the model.
- Dichotomous Rasch Model – used where there is a binary in terms of score, you either get it right or wrong!
- Polytomous Rasch Model – used where there may be a number of different scores for an item (e.g. a single item could get you 0,1 or 2 points, depending on how well you answered it
- Multifaceted Rasch Model – used when there are multiple variables to take into account, such as rater severity or collusion
There are then various different tools and techniques used to test how well an item fits a model. This information is then used to make changes to the poorly performing items and calibrate test results of live tests. Mark referred to these ‘model fit’ methods as like tools in a toolbox. Not all of them will be used to test every item, but certain tools will be brought out when needed. My understanding of it is that these tools are used to create an ICC of a test item that can then be used to indicate its predicted level of challenge for people of different abilities. The measures tell us how far the item deviates from the model and in what way. As a test writer or developer you then set your thresholds for how much of what sort of deviation is acceptable for items in your test.
One other important thing to test is Differential Item Functioning (DIF), this measures the extent to which the item behaves differently for different groups of people (e.g. men/women, old/young, Spanish L1 / Chinese L1). The model fit test for DIF is generally the Mantel-Haenszel method. Items that show different behaviour for different groups will have a higher DIF score and can then be removed or edited to bring the bias down.
Mark said that for Cambridge English exams, the manuals for test item writers are literally 100s of pages long, specifying what cannot and has to be done for items to be able to display an acceptable fit and acceptably low DIF. He mentioned that controlling for L1 and age DIF is especially hard and that a lot of the testing done is around improving this metric.
Questions/thoughts from the session:
- It would be good to know more about how this information is used to calibrate adaptive testing and teaching
- What exactly is the difference between IRT and Rasch modelling?
Measures of Linguistic Aptitudes and Abilities
After lunch we had a session on personality testing from Emmanuel Yannakoudakis of Ariston Psychometric Tests, and we then did a example language test ourselves. I wasn’t quite sure how the test worked but I got sent a report on my linguistic proficiency level for English (medium-high!). There was some suggestion that psychometric testing could help teachers by providing data on students’ learning styles. This made me a bit uncomfortable as I was pretty sure this was a theory which had been debunked in ELT. Others here did say maybe that position is being revised though. A debate to be had (again) there maybe?
Vocabulary Acquisition Tools
Lastly we had a session on vocabulary acquisition from Daniel Gorin of Alphary. Daniel took us through the story of Alphary, from a pure spaced repetition flashcard app right though to what it is today and the corpus and NLP power that goes into the suggestions it makes for learners. Alphary is the intelligence behind a number of vocabulary apps, including the Oxford English Vocabulary Trainer.
Around 5 years ago, through conversations with educators and friends, Daniel began looking in more detail at what it actually means to know a word, and what a vocabulary acquisition tool should really provide in order to help learners achieve that goal. Through this research, two main conclusions were reached:
- Words are all about their connections with other words. Again the Firth quote was used: “You shall know a word by the company it keeps” (as an aside, this for me is the quote of the week so far, and pretty much informs all of the NLP and ML work being discussed here)
- Learners need, and are lacking, constructive, formative feedback from their self-study tools. Lots of EdTech tools and apps give binary right/wrong summative feedback, making it hard for learners to improve as a result of the effort they put in.
Alphary went about creating an app that helps people to learn the vocabulary they need, and gives them the best possible ROI by optimising the words taught and the feedback given. I really liked this idea of breaking through the right/wrong paradigm and allowing learners to see which aspects of their answers were right, and which were wrong. For example, in a gap fill, a learner may choose a word that fits grammatically, and has the right meaning, but doesn’t collocate with the words around it. This information can be really helpful for a learner, rather than a simple ‘incorrect’ bit of feedback. The mechanism used for this feedback in Alphary is Feebu, the feedback butterly. Feebu visualises the correctness of the learners choice in different areas and then suggests hints that will help them answer more correctly.
From a product perspective, it was great to see an app that uses a lot of the corpus analysis and NLP techniques that we have seen discussed here this week and how they can have a real positive impact on the learner, much like with Write and Improve. Daniel mentioned that the Alphary team was an example of the coming together of experts from multiple disciplines to create something that really delivered some impact to learners. It would be great to see more of that and I wonder if maybe some product teams can be formed with some of the expertise here at this summer school?
Daniel finished with a few words of caution:
- People working in this space should be careful not to oversell the potential of automated teaching. These are early days and overselling technology too early can have a negative backlash down the line
- There are still a lot that they can’t get right at Alphary. Poor dictionary definitions or unhelpful feedback still present themselves relatively regularly and more work is needed to move the degree of acceptability higher
- You can’t possibly spend in an app the amount of time required for effective vocabulary acquisition, especially at B1/B2 level. Deliberate learning may only make up 20% of the total time required, so it’s important that we promote incidental learning in our learners and get them exposing themselves to English away from the classroom and the learning products they use
- Knowledge and skill are different. Vocabulary acquisition tools help learners improve their knowledge, which may in turn have a positive impact on skill, but it’s important to be cognisant of the differences.
It will be interesting to see developments from Alphary over the coming months and years
Today is Diane Nicholls’ birthday, so with big birthday wishes to her, I’m off to join them all for some sort of celebration by the sea!