
This week ELTjam are at the ALTA Machine Learning Summer School in Crete and you can read regular updates of what’s happening here on the blog.
Today, Day 3, through her morning workshop, Diane Nicholls gave us an insight into the human element to the Write & Improve (W&I) product, both in terms of the annotation done to the the text by human annotators, and the insights that teachers can get into their learners’ progress (You can read the summaries of the other days here: Day 1, Day 2, Day 4, Day 5)
For those of you who don’t know, W&I is free a tool that allows learners to have instant, automated feedback on their writing, along with suggestions for how to improve the text when they resubmit. There are also a range of teacher tools to support learners.
Seeing W&I in action was a really interesting insight into the inner workings of the product. What follows is a summary of the day and a list of things it would be great if we could collectively answer! As with other days, this is just what I understood from the day and I hope any corrections or clarifications will be made in the comments.
Human Annotation of Errors
Each text submitted to W&I is automatically graded by the system and is then passed to a human marker (called a humannotator) behind scenes. The corrections that the human makes to the text are never seen by the learner, but are used to feed back into the automated marking system by facilitating the creation of rules. I’m not exactly sure, however, of how these rules are then used and how they fit in with the system as a whole (see below for questions from the day. In this case, question 1).

Each script submitted to W&I passes through a number of stages:
- A script is submitted and the learner gets an automated score.
- The same text is then marked by a human annotator in the system. This involves the addition of text, removal of text, or replacement of text for something different. The marker then adds a CEFR score and submits their corrections. The interface for the marking is clean and clear and effectively acts like Google Docs or Word’s track changes feature.
- These corrections are then checked by Diane Nicholls, who makes any changes to what she sees as inaccurate correction (see question 3, 4).
- Based on the Pat of Speech (PoS) tagging and other analysis that the system has done of the learners’ text, and the corrections supplied by the humannotator, the system is then able to classify all the errors (see question 5)
- Rules are then created from an error. For example, if ‘I goed to the beach’ was consistently corrected to ‘I went to the beach’, a rule may be created to say that this change should always be suggested to the learner. In order to highlight some of the challenges in rule creation, Diane created a short quiz where for every learner error and suggested correction, we had to decide firstly if the error was always wrong, and then if it was, could it only be corrected by the option in the question. In very few cases is something a) definitely wrong and b) fixable in only one way. (See question 6)
- These rules are then used to feed into the automated error correction and suggestions for the learners (again, see question 1, 2)
Teacher Features

For teachers, W&I allows you to create a class or ‘workbook’ that you can invite learners to via email or a code that they enter through the website. This allows teachers to track the progress of their learners in class groups.
Once in, a teacher can see a huge amount of information about their learners’ writing. They can see all the essays they’ve submitted, and the number of resubmissions, the differences between the different attempts, the scores they got for each, and then comparative data about the learners in the class and how they are progressing. This is a really impressive teaching tool to track writing progress.
Some teachers have also expressed an interest in marking and annotating the work of their learners directly and this feature is on its way. This has a dual benefit for W&I as it provides functionality that teachers want and it also provides correction and annotation data that W&I can use to improve the quality of the automated correction. The idea would be that teachers can annotate the texts and add the specific codes for the errors that they see, then assign a grade and add global feedback and present back to their learners.
As this feature is still being conceptualised, Diane asked the group what they would want to see in a feature like this. The key points to come from this were:
- In order for this feature to be effective, it should reflect how teachers generally behave and the way that they currently do their marking for their learners
- It should be easy to use an intuitive
- There is an opportunity here to allow for more global level feedback at this stage as all the other feedback to the learner is very grammar-/vocab-based. This could be an opportunity to give feedback on task achievement, cohesion and coherence at text level (question 7)
- It’s important to speak to teachers during the development process, to do focus groups and understand better what teachers need. Effort should be made to ensure this is done in a way that the questioning doesn’t influence the answers!
It was really great to see the W&I features in action but it was still a bit hard to see how all the things we had seen in the last few days actually practically come together to offer the learner feedback the system provides (question 8)
Questions
1. How exactly are the rules that come from the humannotation used in the automated correction?
2. Are we close to, or would it ever be desirable to get to, a point where we no longer use human corrections and annotation, and just rely on error correction methods that use native data?
3. If it’s desirable for error correction techniques that use annotated data to have as many annotations of the same data as possible, why do W&I effectively collate the correction of the original marker and then the reviewer into one rather than treating them both as separate annotations?
4. Is there a possibility that learners using W&I are effectively being taught to write in a specific way that the humannotator team and Diane find acceptable? Would that be a problem if so?
5. Is it the case that all error classification is done automatically? What happens with the longer range tagging such as idioms? I saw no way of these being added by the original markers.
6. If a truly accurate error correction rule is hard to define, what are the precision and recall rules/thresholds around how these W&I rules are used? In Day 2 input we saw that the rule-based error correction systems is one of the least effective and most time consuming to update. Are the rules created from the human annotation being used in this way or are there other other benefits/uses?
7. Are there any plans to try to automate feedback at wider text or paragraph level rather than just grammar and vocabulary checking. Is it the case that now in W&I a learner can write any response to any prompt and the score / feedback they get would be the same
8. It seems that W&I uses a range of different NLP and error correction techniques. How is it decided on the level of impact that each technique has in terms of the overall experience for the learner. How are those algorithms calculated and tweaked? It would be good to get an understanding of how this works.
9. We saw presentations about very cutting edge techniques and data around error correction and ML. How closely does the W&I backend follow these developments? Is there a lag?
10. A couple of times, when asked about whether a particular technique or technology was being used in W&I, an expert would say something like “the learners wouldn’t notice a difference from this advancement”. Can this really be true? If it is true, what are the other incentives for the sorts of research being discussed?
After the session on Wednesday, there was a coach trip around the local Chania area. I didn’t attend but hear it was good fun. There are some photos here.
Thanks for the write-up, Jo! A couple of clarifications:
JS: “Each text submitted to W&I is automatically graded by the system and is then passed to a human marker (called a humannotator) behind scenes. “
DN: The process, crucially, has more stages than this:
1. Student submits first iteration to Write & Improve
2. Student receives automatic system feedback
3. Student acts on feedback and resubmits any number of times, implementing system feedback and changes based on their own studies or teacher feedback
4. Student submits final iteration
5. Final iteration (with all system-feedback-based changes and other learner edits) is passed through for human *error* annotation by an EFL teacher.
6. The final iteration, annotated by a teacher is passed through to the NLP pipeline, where candidate rules are automatically identified and passed on for validation before being fed into the system to supplement other rules created previously based on the training data (CLC).
What the humans annotate are any errors that the system has not been able to detect and correct using different combinations and incarnations of the algorithms and methods discussed in the preceding sessions (your excellent posts Day 1 and Day 2). Human annotation’s aim is to help the system detect and offer corrections for errors it is not currently able to detect with the required degree of certainty for Write & Improve (90% or more).
This annotation by teachers happens after the learner’s last submission and can be done anything from a few days to many months later.
JS: “These corrections are then checked by Diane Nicholls, who makes any changes to what she sees as inaccurate correction (see question 3, 4).”
I don’t make any changes to the annotations made by our EFL teacher annotators. I spot-check each teacher’s annotation and provide them with feedback if I think they are, for example, editing rather than error correcting, or have started to miss common errors through familiarity or understandable lapses of concentration (see my slide on the natural pitfalls of being a human annotator).
I hope this makes the nature of the humannotation and where it fits into the pipeline clearer.
I think maybe your question 3 is based on a misunderstanding that I hope the above clears up (just one version, no collating).
Re: your question 4, isn’t any teacher who highlights and corrects a learner’s errors arguably (perhaps inevitably) teaching their learner to use English the way they think is acceptable? What else can they do? How are our teacher annotators different, apart from in that they are trained to avoid the temptation to edit, embellish or to try to improve the English and to restrict themselves to errors only, as far as possible? And remember that our teachers’ annotations are then passed through a number of consistent automated scientific checks and balances before being cleared to be presented to a learner.
Interesting write up and questions Jo.
On Q10, I think the point made in the session was that as the engine is updated on the basis of extracting rules from annotated input, it is able to correct errors that are a bit less frequent or harder to identify (although still relatively common, as it is trained on errors that are appearing more than once in the input). Any one *individual* learner would only notice that improvement if they make a particular error that was not corrected before, but now is, and also remember that they’d made it before – which is not that likely. However, on aggregate, the engine is still performing better, as it becomes more able to cope with knottier errors.
The session was extremely interesting, a clear explanation the human annotation part of the process. Hats off to Diane!
Re question 4.” Is there a possibility that learners using W&I are effectively being taught to write in a specific way that the humannotator team and Diane find acceptable? Would that be a problem if so?”
ELT is by nature prescriptive – we language teachers base instruction assumptions of what is right and wrong for our learners. The approach Diane and her team follow is very learner-centred – it focuses entirely on what learners actually write rather rather than basing itself on a syllabus of important items from a specific dictionary, grammar reference or other canonical source. This is new and very different. Teachers may feel uncomfortable about it, as W&I cannot simply list its sources, so they may feel left out of pedagogical decision-making and tend towards pushback, as Jo mentioned in his session. Hopefully, this summer school, posts like this and the W&I team’s proactivity in communicating the underlying research will promote understanding of the data-driven nature of the techniques. Managing expectations, understanding and fear will always be part of the process of operationalizing the research and making such an innovative service possible.
Hi, I’m Paul Butcher, co-founder and CTO of English Language iTutoring, the company behind Write & Improve.
Diane has already responded to most of your questions. I wanted to answer question 7: “Is it the case that now in W&I a learner can write any response to any prompt and the score / feedback they get would be the same”
Right now, the answer to that question is yes. But we will be rolling out “prompt relevance” technology next week which will ensure that this is no longer the case. A user who creates an off-prompt response will see this fact both reflected in their overall score, and in a separate “prompt relevance” score.
In general, we have an extensive roadmap of additional functionality that we will be rolling out over the coming months and years. Watch this space!
thanks Jo for these reports very interesting; i’ll have a go at question 2 – NO : )
ta
mura
For anybody wanting to follow up on what the presenters from ALTA talked about, there are lists of all of their research papers to 2016 on the institute’s page http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/ALTA
This is regularly updated.
This talk by Ted Briscoe at elex conference touches on many of the questions raised here by Jo. The questions at the end of the talk seem particularly relevant on the creation of rules vs the Perceptron model and how it all ties together. https://youtu.be/AI_jhIDSMtM
Does this help?
JS: question 6, “If a truly accurate error correction rule is hard to define, what are the precision and recall rules/thresholds around how these W&I rules are used? “
I should point out that the examples I chose for my quiz were deliberately chosen for their fiendishness to demonstrate how difficult it is *for humans* to judge these things and to stimulate discussion. Rule creation *by the system* is much simpler as it has access to so much information, from the training data, from the W&I data, and from native corpora. A brief description of the process:
Results from analysis of ‘the see’ will show that the training corpus contains 126 positive instances [pro] (i.e., supporting a correction) and 2 negative instances [con] (i.e., not marked up as being incorrect). [val] is the quotient of the two numbers (63 = 126/2). As I mentioned in the workshop, the current threshold for a rule to be applied is 10. So, this rule can safely be passed through for validation.
Here’s an example from the quiz as the system sees it:
[_id] => way of live
[pro] => 19
[con] => 2
[val] => 9.5
[cat] => R:OTHER
[beg] => 2
[end] => 3
[cor] => life
Here you can see that this does not quite qualify to be made a rule.
I hope this helps.