Deconstructing the Duolingo English Test (DET)

Duolingo is a great language learning tool. It can introduce you to the basics of a number of different languages through a fun, game-like app in which grammar and vocabulary are built up and reinforced through translation practice. I can thank Duolingo for my basic ability in Spanish and my deeper understanding of Polish grammar. The Duolingo English Test (DET), on the other hand, is absolutely terrible. Last year, it made some buzz on the internet as a kind of TOEFL/IELTS killer, a serious competitor to the big tests – one which was affordable (at $30) and accurate. I had a chance to take the test last week to see if it could serve our institute and students. I went into it very excited and came away with a very bad taste in my mouth.

Duolingo Test score
I’m an expert in English. Thanks for telling me, Duolingo!

Safety first

One thing I did like – probably the only thing – was that the test was pretty secure. There were a number of safeguards to ensure test takers do not cheat. You have to show ID to the camera, cannot wear headphones, must be the only person in a well-lit room, and must stay visible on camera the entire time. It would be pretty difficult to have someone assisting you or helping you cheat. So, as far as security goes, its pretty solid.

QUESTION TYPES

For the test that I took, I encountered 5 question types.

1. Gap-fill Reading Passage

You get a reading passage from what seems like an English literature source. There are gaps throughout the passage and each gap has a dropdown box with the same word choices.

The reading passages were actually quite difficult, and, without further context, some just didn’t make sense. The fact that they seemed to be from English literature from the late 19th or early 20th century is alarming, as they contained a good bit of verbose prose that most ELLs will probably have not encountered. This definitely throws off their ability to choose the correct grammar form of the word choices.

One passage read like the sentence below. The underlined words are completed gaps. Can you fill in the others? (Note: I have changed the words to avoid any copyright issues but tried to keep the syntax of the original):

After I came in with the vodka, they __________ already arranged on both sides of the General’s dinner-table — Big Bear next to the window and sitting backwards so as to have one eye on his companion and one, as I __________, on his exit.

Choices: had, have, came, come, comes, seated, sitting, seats, think, thought, was, were, did, do

It’s not overly difficult, and it’s not the passage that stumped me (there was one that did). However, it struck me as both out of place and even a bit inappropriate with the mention of alcohol. The prose is so strange that even if a student can get most of the words correct, will they have any idea of the meaning? I doubt it, and Duolingo doesn’t even check. That’s right. There is no reading comprehension other than the ability to follow simple instructions.

To be fair, the DET is an adaptive test, meaning the questions increase with difficulty the more a test-taker gets correct, so it is possible that those who miss more questions will get passages that make some sense and are a bit easier to complete. I took the test again and purposely got most questions wrong. The first gap-fill came after English Word Selection questions, Listen and Write, and Listen and Speak. The text that I got was quite above level for someone who got everything else wrong. I purposely chose wrong answers for everything and did not see another gap-fill during the test. I took it once more and this time I tried to get most answers correct and a few wrong. I saw two gap-fill prompts, both equally as difficult. They included words such as rattled, stirred, and save (the preposition form) among other words students likely had not encountered before. The second one seemed longer, so maybe it is adaptive in terms of length, but certainly not difficulty.

2. English Word Selection

You will see a number of real English words (such as resentment, vanquish) and words that look like English but are not (such as commemoral, executrive).

On the surface, this seems like a decent way to check one’s vocabulary. That is, until you realize that the real English words such as resentment, vanquish, pub, or floppy are probably not on any high frequency word lists (I checked the GSL and AWL) and are probably not encountered much in English study, day to day use, or even ESP situations (aside from pub!).

So, what is Duolingo testing? Judging by my test results labeling me an “Expert in English,” I suppose it’s a check to see whether or not I am an L1 user of a standard variety of English, or the extent to which I approximate that. I got 99%, so am I 99% a native-speaker? How does that correspond to a TOEFL score or CEFR level? Those scores and levels are not without their issues; however, at least they offer language ability descriptions. The DET does offer some explanation of my score:

Duolingo score explanation

Like the Gap-fill questions, no comprehension of the words was actually required. How this test knows I can understand something without testing this is beyond me. What’s this about relevent information, or even scanning? This did not come up once during the test. “Finer shades of meaning”? Um, OK…

3. Listen and Write

For this section, you hear a sentence (up to three times) and must write it verbatim. The sentences seemed to increase as the test went on. I did not think the vocabulary or grammar of these sentences were very difficult. I feel they did an OK job testing listening skills, though for the longer sentences, they also seemed to test whether the test taker had a decent phonological loop / working memory, as it took a bit of mental rehearsal to remember and write the dictation. Something to note here is more evidence of a major pattern flaw: no comprehension of the sentences is required.

4. Read and Speak

Here, you had to read a sentence aloud. Like the Listen and Write section, the sentences were pretty mundane, although, one of the sentences did contain the word “crap,” which surprised me. It seemed out of place for a test that wants to be taken seriously. This section seems to be accessing your pronunciation and ability to read aloud as, again, no comprehension is required.

5. Oral-interview Type Questions

Finally, there were several questions that required an oral response either based on a spoken question or a picture description prompt. I was asked about a person I thought was adventurous, and I was asked to describe a picture of a woman waiting for a subway train. While answering the first question, I was surprised when the test interrupted me during a short pause to ask me a follow-up question. Duolingo is clearly trying to emulate a more authentic language use situation, but it comes off as robotic and jarring – Duolingo had interrupted me even before I had finished my thought, the very reason I paused. The questions were very simplistic and while an experience language teacher could make a decent, holistic assessment about a student from these prompts, it’s not enough to base a whole test on, especially when you are trying to compete with the big boys.

“Scientifically Proven”

Throughout the Duolingo English Test FAQ, you can find many references to this test being “scientifically proven”

  • “The Duolingo English Test provides scientifically-proven language certification.”
  • “The Duolingo English Test is scientifically designed to provide a precise and accurate assessment of real world language ability.”
  • “The Duolingo English Test provides scientifically-proven language certification.” (at the bottom of my certificate)

The TOEFL, IELTS, and other major language tests have gone through years of development, testing, and research and still make no bold claims about being “scientifically proven”. So, what does Duolingo really mean? Label anything with “science” and it seems more believable, but if you read carefully, the claim is that the certification is “scientifically-proven,” meaning that the certificate comes from a scientifically designed test, and by science, I think they mean through their impressive ability to design an adaptive test via computer science.

Or, perhaps they mean that there has been a quantitative study (not peer-reviewed). From the FAQ:

Duolingo English Test scores are significantly correlated with the TOEFL iBT (a standardized English test). Read the validity study here for a comparison of Duolingo English Test scores with scales from other common language tests.

In this thirteen page (!) scientific article, we see that DET scores are pretty well correlated with TOEFL iBT test scores (though the correlation is weaker for individual subskills). On page 11, the authors state: “Scores from the Duolingo English test were found to be substantially correlated with the TOEFL iBT total scores, and moderately correlated with the individual TOEFL iBT section scores, which present strong criterion-related evidence for validity.”

I am by no means a psychometrics expert, but wouldn’t it make more sense to be looking at construct or content validity as opposed to criterion validity? Criterion validity is predictive in nature, concerned with how well a certain measure or test is related to a predicted outcome. In this paper, they looked at how well DET scores predicted TOEFL scores. I think the fact that there is strong correlation is interesting. How can a test with no measure of language comprehension, written expression, or academic language use be as valid as the TOEFL iBT, which contains all three of these constructs? Has Duolingo found the perfect questions that dig deep within a language user to pull out their capacity to, I don’t know, summarize and compare a lecture and reading, and do this only based on their ability to select real English words among some lookalikes? That’s good science!

Another non-peer reviewed Duolingo-commissioned study compared Duolingo test scores to faculty assessments of incoming freshman international students. Another high correlation was found and they recommend the DET as a placement test. That was a red flag to me. I took this test with the idea in mind that it could serve as a placement test for my program. After taking the test, I would in no way recommend it.

I think that doing some sort of construct validity test to check whether their questions measure what they say they measure is more warranted than a correlation study. There is very little published about the Duolingo test other than Duolingo-issued research. However, I did find a critical review published in refereed journal that essentially found all the same issues I did and sums up the test as follows:

In summary, at the time of writing this critique, the DET seems woefully inadequate as a measure of a test taker’s academic English proficiency or for high-stakes university admissions purposes….The test seems to be a case of “the tail wagging the dog,” in that the DET’s reliance on short, computer-scored test tasks has resulted in a test that does not assess the test takers’ communicative competence. Indeed, the test tasks that are used hearken back to the 1950s, when audiolingualism was the dominant theory in language learning. (Wagner & Kunnan, 2015)

Ouch. Sorry Duolingo. You might make a fun, somewhat effective language learning tool, but when it comes to language testing, your owl needs to take off its graduation cap and put its tracksuit back on. TOEFL and IELTS are nearing the finish line, but you are still just warming up.

Duoran

 

References

Wagner, E. & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment Quarterly, 12(3), 320-331. DOI: 10.1080/15434303.2015.1061530. Retrieved from here.

Anthony SchmidtAnthony Schmidt is an instructor at the University of Tennessee, Knoxville’s English Language Institute.  He is editor of www.eltresearchbites.com, a website that shares ELT research. He also blogs at www.anthonyteacher.com. He has experience teaching in Korea, Japan, and the US. Currently, his teaching and research interests are related to academic literacy and writing.

This post first appeared on Anthony’s blog anthonyteacher.com

9 thoughts on “Deconstructing the Duolingo English Test (DET)”

  1. The Gap-filling Reading Passages remind me very strongly of unfiltered corpus text. I’m afraid some rather big cheese dictionary websites have gone for a similar approach and the results are equally unsuitable, sprinkled with unsuitable vocab or inappropriate situations, or just plain opaque/jargon-filled language.
    This is not what computer-assisted learning is supposed to be about. The underpinnings (unedited/unfiltered/not-chosen-by-human) should not be so obvious!

    Reply
  2. I know the field and don’t know how and why in this day and age anyone would trust a test that is wholly without human interaction? That’s really the rub and case in point. This test is really just another Oxford placement test printout done online. And they charge 30 bucks.

    We’ve developed at EnglishCentral, a wholly free level test for all students. Secure. Teacher gets access to the report and recordings. Done with a qualified teacher, synchronously online. Check it out. Just register and take it. Would appreciate your feedback. We continue to improve the test based on the volume of testing that takes place. It’s a process but I believe that the human to human communication and assessment that takes place with that as the focus is key in this day and age where English is used as a means of listening and speaking communication.

    Reply
  3. The outdated and possibly offensive content in the gap-filling reading passage raises questions about Duolingo’s editorial process for these tests. Any editor worth their salt would immediately recognize that a reference to vodka would disqualify this passage for use in several major ELT markets. I’m guessing that the folks at Duolingo either don’t know this, don’t care, or are only vaguely aware of the material in their own tests. Duolingo are obviously trying to muscle in on the lucrative test market, but it seems highly unlikely that any large organizations or educational establishments would see this as an adequate replacement for the likes of TOEFL. That said, the tests will probably gain some traction within the Duolingo community of learners (assuming such a thing exists).

    Reply
  4. I came here after reading a reference about Duolingo Placement Test in a myTESOL Lounge Digest. This article is a must-read for anyone who is looking for a good review of the same. Thanks for taking the test in many ways and reporting back to us your thoughts.

    Reply
  5. Totally disagree with this professor. The test is incredibly accurate. He may not like that it is so good without relying on the kinds of traditional methods used by TOEFL and IELTS, however, the DET is akin to a test for heart problems that doesn’t require open-heart surgery. We have used it for two years to gauge whether our students are ready to take TOEFL and it’s be spot on. Plus, college admissions officers advised us they prefer this test because they can watch the students’ interview portions and get a real sense of the student.

    Reply
  6. I have a student from China who has taken the TOEFL twice and scored a 67 each time, then took the DET and also scored a 67! But the DET 67 correlates to a TOEFL 96! This student’s English fluency is weak, so I am perplexed as to how his score could be that high? It does not give me much confidence in the DET.

    Reply
  7. One thing is clear, the Duolingo test is changing the other tests. For example, Toefl test decreased it total test time. I think the “major” tests will change more, knowing that Duolingo is avalaible and costs quite less than the majors$$$. This is a big issue, considering to turn the education more and more accessible. About the Duolingo happen without a human…. the humans will be there in the Universities! (I’m a student, about to take the Duolingo test and preparing to Toefl test, and sometimes using the Google translator)

    Reply
  8. I haven’t taken the test yet, but plan to do it soon. It would be interesting if you retake the test again, maybe they have updated the algorithms and the test is better.

    Reply

Leave a comment