All this week ELTjam are at the Machine Learning for ELT Conference in Crete.
Day 2 looks in more detail at automated error correction techniques, error correction related to content words, the importance of Learner Experience Design (LXD) with all this theory, and finally a look at the Write & Improve product from ALTA.
Grammatical Error Correction
Chris Bryant and Mariano Felice opened the day by looking at Grammatical error correction, following on from Marek’s talk at the end of Day 1. They began by looking at a few different ways of analysing text for errors and their relative merits:
This method is based on establishing rules that a new text either adheres to or breaks. An example of a rule might be ‘a first person singular pronoun is followed by ‘am’’. If a piece of text breaks a rule, then the system aims to correct by adding what it believes to be correct based on what wouldn’t break the rule. This is a simple method, but can very quickly get incredibly complex and unruly as the sheer volume of rules to be updated and maintained makes this approach unmanageable.
This follows on from what Andreas discussed on Day 1, where the model is used to assess the probability of a certain sentence or combination of words is correct or acceptable. This is achieved by analysing N-grams (unigrams, bigrams, trigrams etc.) in training data and therefore being able to predict for any given context (or string of words) what would be the most likely or probable next word. The trade-off here is that in texts that are analysed with low N-grams (1,2,3 words grams), the frequency of each combination in the corpus will be higher, but some of the context of the sentence is lost. With longer N-grams (say, 8,9,10 words grams) the frequency of each combination is very low (often only occurring once in training; meaning that new data may never have been seen before) but the context for the words that are seen is much better. A good compromise is apparently to go up to 5-grams!
In this method, backoff and smoothing techniques are used, basically to pretend we have seen data that in fact we haven’t.
In this LM method, the following steps are taken to correct an error:
- Train model on native text (see above!)
- Create a ‘confusion set’ – for example, to test errors related to the use of prepositions, a confusion set would be a list of prepositions
- Find a learner sentence with a preposition error (e.g. a sentence where the learner’s use of a preposition results in a very low probability score)
- Substitute all other propositions in the confusion set with the preposition error
- Score all these alternative combinations for probability using the model
- Suggest one or a number of preposition options that score higher than the learner’s error
One advantage of this is that you only need native data, no annotated error correction, but it falls down in that probability does not always correlate with accuracy.
Instead of rules about what is right and wrong, a classifier system requires you to define features of right and wrong. The model is trained with native text and features are defined for different PoS tags. (For example ‘were’ – is_auxiliary:yes, found_with_past_time_expressions:yes, follows_’we’:yes.). The language model, based on training data, then establishes which features are most reliable in defining a word’s correctness. In the case of new data being fed in, the system then classifies words in their contexts based on the features and decides correct/incorrect. If incorrect it can suggest correct options based on which alternatives are classified as accurate in that context.
This system is more flexible than rules and possible with only native data, but it can be complicated to define all the necessary features and doesn’t work well on more complicated error types.
Statistical Machine Translation
This method uses a translation model to determine whether a text has errors. We generally think of machine translation being used to translate between languages (e.g. Greek to English) but it can also be used to translate between ‘good’ English and ‘bad’ English!
A big requirement here is parallel data to train the model, so we need lots of examples of ‘bad’ English corrected into ‘good’ so that the model can build up an understanding of how corrections should be made. This is done by finding phrase pairs within sentence combinations (e.g. [‘I goed’:bad|’I went’:good]. If trained in this way, when seeing ‘I goed’ in new data, the model can suggest replacing it with ‘I went’.
The issue here is just how much annotated data is needed. This can be fudged to a certain degree by corrupting good data to make ‘bad’ English based on commonly made real errors, but this data isn’t as good as actual annotated data.
Tools: FCE corpus is publicly available and contains annotated learner data that could be used for this
Neural Machine Translations
This method is most effective and most complicated new kid on the block! It’s this method that’s responsible for the big improvements in Google Translate over the last year.
- Words are given vectors based on the semantic space that they occupy
- An encoder builds up a representation of a sentence by combining the vectors of the individual words
- These representations can then be analysed for probability/accuracy
- A decoder interprets these representations or similar ones that are more likely (or the most likely translation into another language)
This has the advantage of being effective, but the disadvantage of being incredibly complicated and confusing.
After taking us through these different tools. We were given a task that you can try at home:
Go to this url and enter your email address to begin. You will see 20 questions, each of which begins with a text potentially containing learner error(s). Follow these steps:
- If you believe the original sentence is Ok as is, tick that box and move to the next question, or…
- If you believe one of the machine translations below it is correct, tick the one you agree with and move to the next questions, or..
- If you believe you could make a better suggestion, add it to the box at the bottom of the question and then move to the next question
Once you have done all 20, submit your answers and go to this url. Here you will see the results and can compare your error correction quality with that of other humans and also machine methods mentioned above.
In terms of interpreting the results, let’s look at how error corrections are evaluated. The process is as follows:
- Use a ML error correction method on a number of sentences
- Get human(s) to correct the same sentences
- Compare and compute
An important part of this comparison is aligning and edits by the machine and by the human. In each case one of the following may be true:
- Machine accurately corrects
- Machine accurately doesn’t correct
- Machine inaccurately corrects
- Machine inaccurately doesn’t correct
In the case of 1 we see human-computer matches, and in the cases of 3 and 4, we can find mismatches. The number and ratio of these instances can then be analysed to find precision and recall.
Precision (P) relates to purity, how good the corrections are that the machine makes. It’s calculated by as successful corrections (1) over total proposed corrections (1+3).
Recall (R) relates to the coverage, how many of the errors it picked up. It’s calculated as successful corrections (1) over existing corrections (1+4)
It’s very hard to maximise both P and R. You can be very cautious and correct only propose more likely corrections, but then the recall is reduced, or you can increase recall by making lots of corrections, but the precision goes down as a result.
The overall evaluation of a ML correction algorithm is achieved through combining the precision and recall scores. However, as it’s deemed to be far better to miss an error than inaccurately correct something, P is given double the weighting when calculating the combination.
You can view Chris Bryant and Mariano’s slides here. The deck includes information about a new error tagging system that they are working on.
Error Detection in Content Words
After lunch Ekaterina Kochmar spoke about the Error detection in relation to content words. I will have to summarise here as bullet points as I was a little pre-occupied with my own up-coming talk and dealing with a post-lunch slump!
1. Function words result in more learner errors, but content words account for more of the errors that impact understanding
2. There are many different types of content errors that learners can make, meaning they are harder to detect and correct that errors of function words
3. Learners may make errors due to:
- Similarity of meaning between words (‘big anger’ rather than ‘great anger’)
- Similarity of form (‘ancient Greek sightseeing’ rather than ‘ancient Greek sights)
4. Meaning of a word can be approximated by its distribution (Firth: “You shall know a word by the company it keeps.”)
5. Semantic space construction is used to visualise words in a space based on other words the co-occur with. This allows us to see which combinations of words are more or less likely to be accurate/accepted
6. A learner’s L1 will have an impact on the types of content error that they make
7. Correction follows a similar patterm to with other methods:
- Detect error
- Collect alternatives
- Compare and rank alternatives
- Suggest n-best alternatives
8. When comparing types of errors made by different L1 groups, it was found that speakers of Asian languages, despite their L1 being very different from English, appeared to make fewer content errors. One hypothesis for this is that because the languages are so different, they adopt ‘play-it-safe’ strategies for increasing accuracy.
You can view the slides from the session here.
An introduction to Learner Experience Design
I (Jo Sayers) then gave a talk on learner experience design and how we might benefit from thinking about the learner while we are here engaging with all this deep scientific and technological input. Here are my main takeaways and questions from the talk:
- Different people wanted to visualise the combination of different aspects of good LX in different ways. Did all others sit inside UX, should Teaching context be added?
- Different people were more or less comfortable defining themselves as Learner Experience Designers. Maybe we all input to LXD, but are we all leading it?
- Some people suggesting that we shouldn’t ask learners and teachers what they want, as they never know. I agree with this, it’s more about asking questions to find out how they feel, what they like and don’t like, what motivates them etc.
- Is LXD making enough of the evidence-based nature of it? Does it feel too wishy-washy?
- Should Teacher Experience Design (TXD) be more of a focus. We acknowledge the importance of teacher welfare in the LXD for the Classroom sessions we run, but should the teacher be even more central?
Write and Improve
Ted Briscoe finished the day by discussing how some of the ML algorithms feed into the Write & Improve product, taking us through the updated UI and some of the new features.
For those of you who don’t know, W&I is a free tool that allows learners to submit texts that they write in response to prompts, then get automated feedback on the text, highlighting errors and in some cases suggesting corrections. The idea is that learners then go back and revise their original text before resubmitting. There is now a teacher dashboard that allows teachers to track their learners’ progress across different responses and submissions.
It was good to see in action a lot of the algorithms that we had discussed over the last couple of days. Based on the Cambridge Learner Corpus, W&I is now using new submission data to feed back into their Language Models to improve their ability to provide automated feedback.
Another day of heavy input. Practical day tomorrow! Again, I’m off to the sunshine for a bit! Please add any comments or updates into the comments.