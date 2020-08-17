England's Exam Fiasco Shows How Not To Apply Algorithms To Complex Problems With Massive Social Impact
from the let-that-be-a-lesson-to-you-all dept
The disruption caused by COVID-19 has touched most aspects of daily life. Education is obviously no exception, as the heated debates about whether students should return to school demonstrate. But another tricky issue is how school exams should be conducted. Back in May, Techdirt wrote about one approach: online testing, which brings with it its own challenges. Where online testing is not an option, other ways of evaluating students at key points in their educational career need to be found. In the UK, the key test is the GCE Advanced level, or A-level for short, taken in the year when students turn 18. Its grades are crucially important because they form the basis on which most university places are awarded in the UK.
Since it was not possible to hold the exams as usual, and online testing was not an option either, the body responsible for running exams in the UK, Ofqual, turned to technology. It came up with an algorithm that could be used to predict a student's grades. The results of this high-tech approach have just been announced in England (other parts of the UK run their exams independently). It has not gone well. Large numbers of students have had their expected grades, as predicted by their teachers, downgraded, sometimes substantially. An analysis from one of the main UK educational associations has found that the downgrading is systematic: "the grades awarded to students this year were lower in all 41 subjects than they were for the average of the previous three years."
Even worse, the downgrading turns out to have affected students in poorly performing schools, typically in socially deprived areas, the most, while schools that have historically done well, often in affluent areas, or privately funded, saw their students' grades improve over teachers' predictions. In other words, the algorithm perpetuates inequality, making it harder for brilliant students in poor schools or from deprived backgrounds to go to top universities. A detailed mathematical analysis by Tom SF Haines explains how this fiasco came about:
Let's start with the model used by Ofqual to predict grades (p85 onwards of their 319 page report). Each school submits a list of their students from worst student to best student (it included teacher suggested grades, but they threw those away for larger cohorts). Ofqual then takes the distribution of grades from the previous year, applies a little magic to update them for 2020, and just assigns the students to the grades in rank order. If Ofqual predicts that 40% of the school is getting an A [the top grade] then that's exactly what happens, irrespective of what the teachers thought they were going to get. If Ofqual predicts that 3 students are going to get a U [the bottom grade] then you better hope you're not one of the three lowest rated students.
As this makes clear, the inflexibility of the approach guarantees that there will be many cases of injustice, where bright and hard-working students will be given poor grades simply because they were lower down in the class ranking, or because the school did badly the previous year. Twitter and UK newspapers are currently full of stories of young people whose hopes have been dashed by this effect, as they have now lost the places they had been offered at university, because of these poorer-than-expected grades. The problem is so serious, and the anger expressed by parents of all political affiliations so palpable, that the UK government has been forced to scrap Ofqual's algorithmic approach completely, and will now use the teachers' predicted grades in England. Exactly the same happened in Scotland, which also applied a flawed algorithm, and caused similarly huge anguish to thousands of students, before dropping the idea.
The idea of writing algorithms to solve this complex problem is not necessarily wrong. Other solutions -- like using grades predicted by teachers -- have their own issues, including bias and grade inflation. The problems in England arose because people did not think through the real-life consequences for individual students of the algorithm's abstract rules -- even though they were warned of the model's flaws. Haines offers some useful, practical advice on how it should have been done:
The problem is with management: they should have asked for help. Faced with a problem this complex and this important they needed to bring in external checkers. They needed to publish the approach months ago, so it could be widely read and mistakes found. While the fact they published the algorithm at all is to be commended (if possibly a legal requirement due to the GDPR right to an explanation), they didn't go anywhere near far enough. Publishing their implementations of the models used would have allowed even greater scrutiny, including bug hunting.
As Haines points out, last year the UK's Alan Turing Institute published an excellent guide to implementing and using AI ethically and safely (pdf). At its heart lie the FAST Track Principles: fairness, accountability, sustainability and transparency. The fact that Ofqual evidently didn't think to apply them to its exam algorithm means its only gets a U grade for its work on this problem. Must try harder.
Follow me @glynmoody on Twitter, Diaspora, or Mastodon.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: algorithms, education, exams, grads, predictions, predictive algorithms, protests, testing
Reader Comments
Subscribe: RSS
View by: Time | Thread
The bell curve claims another victim
One of the strangest assumptions in statistics is that results are distributed in a bell curve. It’s doomed many a racist trying to prove the stupidity of blacks, and today it dooms those trying to justify classist policies.
[ reply to this | link to this | view in chronology ]
Re: The bell curve claims another victim
always click preview....
It’s one of the strangest assumptions in academic statistics
[ reply to this | link to this | view in chronology ]
Exam regulator rejected expert help
The Royal Statistical Society offered assistance and nominated two professors of statistics to help design a decent algorithm. But the professors withdrew when asked to sign a non disclosure agreement which would have constrained them for five years.
"We get the point of non-disclosure agreements: you don't want someone offering a running commentary while decisions are being made," said Sharon Witherspoon of the Royal Statistical Society , "But constraining independent academic experts from saying, 'Well, looking at the data, I saw it was clear this would have this effect,' didn't fit our principles of transparency."
https://news.sky.com/story/a-levels-exam-regulator-ignored-expert-help-after-sta tisticians-wouldnt-sign-non-disclosure-agreements-12049289
[ reply to this | link to this | view in chronology ]
Re: Exam regulator rejected expert help
"We get the point of non-disclosure agreements: you don't want someone offering a running commentary while decisions are being made,"
Well, sometimes you do. It seems this project failed at the design phase, so you probably should be consulting with people who are able to spot fundamental design issues for you go live. Especially with something this important. A-Levels are stressful at the best of times, I can't imagine what it must be like for students who've worked their asses off to improve grades enough to go to their preferred university during a pandemic, only for an algorithm to tell they worked for nothing.
""But constraining independent academic experts from saying, 'Well, looking at the data, I saw it was clear this would have this effect,' didn't fit our principles of transparency.""
In other words, an open source methodology was required but they decided to create a proprietary solution.
[ reply to this | link to this | view in chronology ]
Re: Re: Exam regulator rejected expert help
"It seems this project failed at the design phase, so you probably should be consulting with people who are able to spot fundamental design issues for you go live."
There's an extra helping of irony in that the people too unable to seek expert knowledge from outside are also the people who in this case are tasked to teach those who will be tomorrow's scientists.
"In other words, an open source methodology was required but they decided to create a proprietary solution."
I can somehow envision a table of stodgy deans straight out of Oxford going "Let the rabble - probably not even upper seconds - scrutinize and comment on our teaching methods? The Nerve!"
[ reply to this | link to this | view in chronology ]
Re: Exam regulator rejected expert help
Which is why so many management led disasters occur, as that greatly reduces the chance of problems being spotted, or mistaken assumptions being corrected by someone who understands what the managers are trying to do.
[ reply to this | link to this | view in chronology ]
Well, nothing like a pandemic to bring old news to life.
...because apparently, after half a century's worth of universal complaints and tons of data assembled and calculated about the state of education it appears we have come to the same conclusion again; "Get more teachers!"
It bothers me to no end that no matter the country the actual education supposed to provide the foundation of the future always gets the first taste of the budget axe.
Or, as can be observed in the OP, handing the evaluation of the state of a youths education to a complex template deployed by a computer algorithm. And to really guarantee that fail, using a model known to be flawed and keeping the project in a closed workshop with no real expert audit.
What could possibly go wrong?
[ reply to this | link to this | view in chronology ]
A view from the inside
Full disclosure: I work for Ofqual. I had no involvement in the development of the algorithm, but obviously know those who did. These are my personal views, not any sort of official response.
A few thoughts:
I can say with absolute certainty that every decision we took was a genuine effort to maximise those twin aspects of fairness. And that there simply was no solution that was perfectly fair to all past, current and future students.
That's not an excuse. I agree that some of the outcomes here weren't right. But I think it's important to understand what we were trying to achieve.
I don't think it's accurate to suggest that students in affluent areas saw their grades systematically improve over what their teachers predicted. Very few students were "upgraded" by the algorithm.
And that lack of consistency in judgement means relying solely on teacher predictions is inevitably unfair on those students whose teachers happened to be at the less-optimistic end of the scale.
It also means it's not entirely meaningful/fair to measure the success/failure of the algorithm in terms of how students' outcomes differed from their teacher's predictions. In at least some cases, those differences will reflect error in teacher predictions more than any error on the part of the algorithm.
Should we have released the model earlier to allow for more external scrutiny? With the benefit of hindsight, certainly. But one of the things we had to worry about was the possibility that releasing it could affect how some (less scrupulous) schools went about making their predictions.
Another difficulty with a fully open-source model here is the nature of the data: highly sensitive personal information about every young person in the country. Not something we legally make available to the public.
So, yes, it's predictable that an algorithm designed to replicate the results of those qualifications in an extreme situation ended up reflecting those inequalities back. But could it ever have done anything else?
[ reply to this | link to this | view in chronology ]
Re: A view from the inside
Thanks for providing that interesting context.
[ reply to this | link to this | view in chronology ]
Re: A view from the inside
Hi, thanks for your insightful post.
A view from the Netherlands: We've had our problems with the final exams too. Luckily we start the examination process early and the "school exams" that usually determine 50% of the exam results were (almost) done when schools closed. After some deliberation it was decided to cancel the "central written final exam" and grant students their diploma based on the results of the "school exams".
On the other hand, I understand the demand for consistent exam results over time, that's also one of the Dutch concerns. However, the quality of education has improved over the decades and especially students from less privileged backgrounds do better than they used to do, which caused the average level in exam candidates to rise. AFAIK grading in Dutch exams allows for this gradual rise in level. Does Ofqual acknowledge such a rise and how does it integrate that in its standards?
[ reply to this | link to this | view in chronology ]
It was a fiasco by design. For twenty years now kids getting better exam results each year has been a major bugbear of the right wing press in the UK, more kids getting into better universities on merit (Oxbridge excluded) meaning those from wealthier backgrounds actually have to work and compete for places. Making university more expensive hasn't worked as well as they'd hoped, so when the pandemic rolled around, they saw a chance to kneecap poorer students and have a convenient tech scapegoat for doing so... But didn't expect the pushback to be nearly as universal.
[ reply to this | link to this | view in chronology ]
Well teachers predictions from every school should be made public along with last year's exam results
It is impossible for all students in a school to get top grades, every school has lazy students, good students, great students, average students eg along the curve.
Also poor students may not have laptops or broadband or a quiet room to study in. Not many poor students get to Harvard.
It's like open source security any exam grading system must be open to public scutiny.
Asking for nondisclosure agreements about a public
exam grading system is pointless and unfair.
Like the law it must be fair and seen to be fair.
But the UK government is prone to secrecy even if its
Not needed.
[ reply to this | link to this | view in chronology ]
Add Your Comment
Add A Reply