England's Exam Fiasco Shows How Not To Apply Algorithms To Complex Problems With Massive Social Impact
from the let-that-be-a-lesson-to-you-all dept
The disruption caused by COVID-19 has touched most aspects of daily life. Education is obviously no exception, as the heated debates about whether students should return to school demonstrate. But another tricky issue is how school exams should be conducted. Back in May, Techdirt wrote about one approach: online testing, which brings with it its own challenges. Where online testing is not an option, other ways of evaluating students at key points in their educational career need to be found. In the UK, the key test is the GCE Advanced level, or A-level for short, taken in the year when students turn 18. Its grades are crucially important because they form the basis on which most university places are awarded in the UK.
Since it was not possible to hold the exams as usual, and online testing was not an option either, the body responsible for running exams in the UK, Ofqual, turned to technology. It came up with an algorithm that could be used to predict a student’s grades. The results of this high-tech approach have just been announced in England (other parts of the UK run their exams independently). It has not gone well. Large numbers of students have had their expected grades, as predicted by their teachers, downgraded, sometimes substantially. An analysis from one of the main UK educational associations has found that the downgrading is systematic: “the grades awarded to students this year were lower in all 41 subjects than they were for the average of the previous three years.”
Even worse, the downgrading turns out to have affected students in poorly performing schools, typically in socially deprived areas, the most, while schools that have historically done well, often in affluent areas, or privately funded, saw their students’ grades improve over teachers’ predictions. In other words, the algorithm perpetuates inequality, making it harder for brilliant students in poor schools or from deprived backgrounds to go to top universities. A detailed mathematical analysis by Tom SF Haines explains how this fiasco came about:
Let’s start with the model used by Ofqual to predict grades (p85 onwards of their 319 page report). Each school submits a list of their students from worst student to best student (it included teacher suggested grades, but they threw those away for larger cohorts). Ofqual then takes the distribution of grades from the previous year, applies a little magic to update them for 2020, and just assigns the students to the grades in rank order. If Ofqual predicts that 40% of the school is getting an A [the top grade] then that’s exactly what happens, irrespective of what the teachers thought they were going to get. If Ofqual predicts that 3 students are going to get a U [the bottom grade] then you better hope you’re not one of the three lowest rated students.
As this makes clear, the inflexibility of the approach guarantees that there will be many cases of injustice, where bright and hard-working students will be given poor grades simply because they were lower down in the class ranking, or because the school did badly the previous year. Twitter and UK newspapers are currently full of stories of young people whose hopes have been dashed by this effect, as they have now lost the places they had been offered at university, because of these poorer-than-expected grades. The problem is so serious, and the anger expressed by parents of all political affiliations so palpable, that the UK government has been forced to scrap Ofqual’s algorithmic approach completely, and will now use the teachers’ predicted grades in England. Exactly the same happened in Scotland, which also applied a flawed algorithm, and caused similarly huge anguish to thousands of students, before dropping the idea.
The idea of writing algorithms to solve this complex problem is not necessarily wrong. Other solutions — like using grades predicted by teachers — have their own issues, including bias and grade inflation. The problems in England arose because people did not think through the real-life consequences for individual students of the algorithm’s abstract rules — even though they were warned of the model’s flaws. Haines offers some useful, practical advice on how it should have been done:
The problem is with management: they should have asked for help. Faced with a problem this complex and this important they needed to bring in external checkers. They needed to publish the approach months ago, so it could be widely read and mistakes found. While the fact they published the algorithm at all is to be commended (if possibly a legal requirement due to the GDPR right to an explanation), they didn’t go anywhere near far enough. Publishing their implementations of the models used would have allowed even greater scrutiny, including bug hunting.
As Haines points out, last year the UK’s Alan Turing Institute published an excellent guide to implementing and using AI ethically and safely (pdf). At its heart lie the FAST Track Principles: fairness, accountability, sustainability and transparency. The fact that Ofqual evidently didn’t think to apply them to its exam algorithm means its only gets a U grade for its work on this problem. Must try harder.