Blog

Can algorithms ever make the grade?

The failure of the A-level algorithm highlights the need for a more transparent, accountable and inclusive process in the deployment of algorithms.

Elliot Jones , Cansu Safak

18 August 2020

Reading time: 18 minutes

Keyword: AI bias

Given that they are in the purported business of prediction, it is natural to assume that Ofqual could have anticipated the level of public outrage that would ensue from the decision to use an algorithm to moderate unstandardised A-level, BTEC and GCSE teacher assessments.

And, in principle, the use of a standardisation algorithm could have been an effective way to solve a difficult problem at a massive scale. With months of teaching and exams cancelled because of a global pandemic, and within a system where exams are deemed the ultimate assessment of ability (for better or worse), there were no good options.

Awarding students’ grades based on teacher assessment was originally rejected by Ofqual on the grounds of unfairness between schools, incomparability across generations and devaluing of results because of grade inflation.¹ The fairer option, Ofqual surmised, was to combine previous attainment data and teacher assessment to assign grades, using a particular statistical model – an ‘algorithm’.

The algorithmic system deployed by Ofqual was not particularly advanced or novel. And the education regulator did not deploy it in the dark – a public consultation was held in April, and the decision to use standardisation models was shared in June.

Yet the choices that were made delivered results that were seen as unfair, unjust and untrustworthy, and have resulted in protests by hundreds of students, lobbying by parents, backlash across the media, the threat of legal action and ultimately a monumental backpedal by Ofqual.

Being ‘transparent’, by telling students their grades were resolved by an algorithm, did nothing to resolve those issues.

Hindsight delivers 20:20 vision, as we know. But Ofqual should have been aware that it was deploying an algorithm against a backdrop of existing public scepticism towards algorithmic systems, and an environment of tenuous trust in Government data use.

It needed not only to meet, but to exceed existing standards for transparency and accountability, to avoid doing indelible harm to public confidence in data-driven decision making.

Building trust and consensus in the goals of the algorithm

Following the decision to use an algorithm, another decision is needed: what it’s optimising for – the ‘goal’. The goal that Ofqual seemingly chose to optimise was ‘maintaining standards’ across schools, by fitting current students to distributions generated from each school’s historical results, using teacher rankings.² This model prioritises avoiding grade inflation, getting the ‘right’ school-level results and maintaining the distribution shape over the fairness and accuracy of individual results.

The model also disproportionately downgraded high-performing cohorts from historically low-performing schools, from A* students aiming to be the first-ever student at Oxbridge, to a group of pupils managing to reach Bs and Cs in classes that have always achieved straight Ds. From the public’s response, it is clear now that this is not what they wanted from such a system.

It would be easy to blame Ofqual here, but the choice of goals – in algorithms as in policy more broadly – is a political question, not just a technical one. Professor Jo-Anne Baird, a member of Ofqual’s Standing Advisory Group, has publicly stated that Ofqual was specifically directed to deliver exam results that controlled grade inflation, and within those parameters, this algorithm is the best you can get.³

So why did the public have to wait until results day in August to find out that this was the goal?

As our recent report Confidence in a crisis? demonstrates, being transparent about approaches and data practices upfront is crucial in building trust, especially in pressurised circumstances, both in the public and those providing the data (in this case, the teachers).⁴ Indeed, Ofqual was urged to be fully transparent about their standardisation model by the Education Select Committee in July.⁵

Instead, Ofqual only began to offer more information about the model once the deadline had passed for centres to submit their ‘centre assessment grades’ and student ranking information, presumably to avoid teachers attempting to grade towards an understanding of the model. Further, Ofqual deliberately did not release the precise model being used until results day itself, because ‘It is an important principle that everyone finds out their results at the same time on results day’ – a principle that apparently outweighs public scrutiny.⁶

In this case, transparency would just be the start of a trustworthy process. Once the goals were identified, and in common with any other policy, they would need to go through democratic scrutiny, debate and accountability before being implemented.

Ensuring the efficacy and accuracy of algorithmic systems

One of the key parts of building trust in algorithmic systems is ensuring they provide accurate results. Ofqual’s model seemingly failed here, though perhaps unavoidably.

Ofqual stated that their model had about 60% predictive accuracy on average across A-level subjects, meaning that they expect 40% of grades would have been different had exams been sat, after testing the model on 2019 data.⁷ This seems unacceptability low accuracy.

But evaluation shows this level of accuracy was broadly comparable to the probability of examiners awarding marks to students’ exam papers that result in the same grade awarded by a senior examiner’s marking.⁸ So the model is no less variable for most students than traditional examination marking, and interrogating the level of accuracy exposes the underlying uncertainty in assessing students at a single point in time.

Ofqual also acknowledged that it would be ‘indefensible to statistically standardise when the number of students is very small’ due to insufficient previous data, and instead awarded those students their teacher-assessed grades. This sounds defensible, but in practice meant better grades for private school students who tend to have smaller classes.⁹

The fundamental issue affecting accuracy is that, without trusting teacher assessments, there is very little recent actual attainment data to determine A-level results meaningfully or fairly for individuals. Had Ofqual chosen to prioritise accuracy for pupils, it may well be there wasn’t enough good data to create an acceptable model, and acknowledging the limitations of existing data is an important part of any algorithmic assessment process.

Ensuring independent scrutiny

Another troubling issue raised throughout the development of the grading model has been the absence of independent, external scrutiny. The Royal Statistical Society’s identification of concerns over the composition of the ‘technical advisory group’, consisting mainly of ‘government employees or current or former employees of the qualification regulators’,¹⁰ and their subsequent offer to provide independent Fellows, was met with the condition of a strict, five-year non-disclosure agreement.

As the RSS has indicated in their written submission to the Education Select Committee call for evidence on the impact of COVID-19 on education and children’s services inquiry, ‘without a stronger procedural basis to ensure statistical rigour, and greater transparency about the issues that Ofqual is examining, it cannot be clear that the statistical methodology will be beyond question’.¹¹

Any system that seeks to operate on the trust and confidence of the public must create conditions that ensure systems are reviewable by independent experts.

Showing evidence of appropriate risk and impact mitigation measures

While Ofqual has asserted that ‘fairness and equality’ are at the heart of their decision making, and that impact assessments have been carried out for all consultations, responses have criticised the equality impact assessment literature review as being outdated.

One ex-director of education has commented that not only does the research fail to address the under-predictions for Black students (by teachers), but also ‘does not take into consideration the impact of multiple disadvantages for some protected groups [under the] 2010 Equalities Act, who will be double/triple disadvantaged by low teacher expectations, [and] racial discrimination that is endemic in some schools.’¹²

While it is encouraging to see these issues addressed in the intention for the algorithm, what’s missing is a demonstration of how the model has attempted to mitigate the various equalities concerns raised in the consultation process. It is important that these steps are taken, and that full impact assessments are made public prior to the deployment of a new system.

Enforcing the legal compliance of algorithmic systems

Under the personal information charter, Ofqual notes that the data subject may ‘object to automatic decision-making and profiling’, adding that ‘Ofqual does not undertake this form of processing’.¹³ Ofqual‘s claim is that it is profiling centres, not individuals, and that the process is not automated because it is subject to human review.

The ICO accepted Ofqual’s interpretation of the GDPR (which has been contested by experts)¹⁴ and advised that anyone concerned with ‘how their data has been handled should raise those concerns with the exam boards first, and report to us if they are not satisfied.’¹⁵ The ICO has also indicated that it will engage with Ofqual, though no correspondence has yet been made public.

In a similar instance, the Norwegian DPA chose to intervene in the crisis surrounding the awarding model of the International Baccalaureate, ordering the International Baccalaureate Organization to rectify their grades.¹⁶ The IB algorithm, like the Ofqual model, factored in what they called ‘school context’ and ‘historical data’ to apply to each student, to adjust the accuracy of predicted grades. In their investigation of the matter in the advance notification of the order, the Norwegian DPA maintained that ‘the awarding of grades is a form of profiling in the sense of Article 4(4) GDPR’ and concluded that the algorithm was in violation of GDPR articles 5(1)(a) and (d).¹⁷

Given this divergence in regulatory response and interpretation-driven application of the GDPR, the current crisis illustrates the need for defining the role and level of intervention expected of regulators in instances where existing guidance fails to deliver definitive instruction.

Setting up systems of redress

Another part of algorithmic accountability is a system of recourse and redress for decisions made by the algorithm. To understand whether this responsibility was discharged, it’s worth interrogating in detail here the process for contesting the decision of the algorithm.

Schools, on behalf of students, were given powers to appeal the results assigned to them. However, days after algorithmically generated results had been assigned, the process for seeking redress from the decision of the algorithm was still contested, despite decisions such as university admission already being made on the basis of those results.

The day before results were released, the Education Secretary changed the appeals process, allowing students to appeal and be awarded their mock results. Two days after grades were released, Ofqual published criteria for a ‘valid’ mock exam. By that evening, those criteria had already been withdrawn.¹⁸ At the time of writing, no further guidance has been issued on appeals and it is still unclear, now that students will receive the higher of their teacher assessed or moderated grades, whether mock appeals will still be allowed.

Students were already being accepted or rejected by universities and other schemes on the basis of the algorithmically generated results, and universities were filling their places for the year. Following a successful appeal, students may be able to correct the A-level grades awarded to be appropriate to their ability, but they would have already suffered the material consequence of the lower grades they were assigned.

Students were offered another form of redress the chance to take exams in autumn 2020 or summer 2021. So those with grades unfairly moderated down were faced with forgoing university or full-time employment for a year, and the additional cost of continued study to maintain the standard needed to achieve their expected grade.

Even if the appeals process had been announced well in advance of grades being released to pupils, and had remained stable, it would still have been inadequate. Algorithmic redress can only be meaningful and useful if that redress can reverse the consequences of an incorrect original judgement.

The systems of redress were inadequate: appeals should have been able to take place by schools in the weeks before grades were released, allowing the algorithmically assigned grades to be adjusted before they had a material impact. They also exacerbated inequalities, placing an additional burden on those unwilling to accept an unfair decision, and favouring those with the means to support an appeal.

Conclusion

The use of algorithms in the public sector has great potential to improve the lives of citizens when used well. However, public trust is crucial to ensuring these systems are legitimate and effective. High-profile failures like this A-level algorithm leave the public justifiably wary about the use of algorithms in future, however necessary they might be.

A more transparent, accountable and inclusive process in the deployment of algorithms, addressing the issues highlighted above, could help to earn back that trust.

The Ada Lovelace Institute is already scrutinising and evaluating the use of public sector algorithms in our Algorithm accountability programme.

Roger Taylor (2020) The fairest possible way to recognise students’ achievements this year, GOV.UK. Available at: https://www.gov.uk/government/news/the-fairest-possible-way-to-recognise-students-achievements-this-year-by-roger-taylor-chair (Accessed: 18 August 2020).
Cath Jadhav (2020) Reflections on the Summer Symposium, GOV.UK. Available at: https://ofqual.blog.gov.uk/2020/07/24/reflections-on-the-summer-symposium/ (Accessed: 18 August 2020)
Turner, C. (2020) ‘Exam regulator fears it could be scrapped as it warns ministers not to make it a “scapegoat”’, The Telegraph, 17 August. Available at: https://www.telegraph.co.uk/news/2020/08/17/exam-regulator-fears-will-become-next-public-health-england/ (Accessed: 18 August 2020).
Ada Lovelace Institute and Traverse (2020) Confidence in a crisis? Building public trust in a contact tracing app. Available at: https://www.adalovelaceinstitute.org/our-work/covid-19/confidence-in-a-crisis/ (Accessed: 18 August 2020).
House of Commons Education Select Committee (2020) Young people risk missing out on deserved results in this year’s system for awarding grades, MPs warn. Available at: https://committees.parliament.uk/committee/203/education-committee/news/147332/young-people-risk-missing-out-on-deserved-results-in-this-years-system-for-awarding-grades-mps-warn/ (Accessed: 18 August 2020).
Ofqual (2020) Ofqual Summer Symposium 2020, GOV.UK. Available at: https://www.gov.uk/government/news/ofqual-summer-symposium-2020 (Accessed: 18 August 2020).
Ofqual (2020) Awarding GCSE, AS & A levels in summer 2020: interim report, GOV.UK. Available at: https://www.gov.uk/government/publications/awarding-gcse-as-a-levels-in-summer-2020-interim-report (Accessed: 18 August 2020). p76-79
Ofqual (2020) Awarding GCSE, AS & A levels in summer 2020: interim report, GOV.UK. Available at: https://www.gov.uk/government/publications/awarding-gcse-as-a-levels-in-summer-2020-interim-report (Accessed: 18 August 2020). p80-81
Ofqual (2020) Awarding GCSE, AS & A levels in summer 2020: interim report, GOV.UK. Available at: https://www.gov.uk/government/publications/awarding-gcse-as-a-levels-in-summer-2020-interim-report (Accessed: 18 August 2020). p126
Ashby, D. and Witherspoon, S. (2020) ‘Letter to Ed Humpherson, Director General for Regulation, Office for Statistics Regulation’. Available at: https://rss.org.uk/RSS/media/News-and-publications/News/2020/14-08-2020-Letter-Deborah-Ashby-Sharon-Witherspoon-to-OSR.pdf (Accessed: 18 August 2020).
Royal Statistical Society (2020) ‘Royal Statistical Society response to the House of Commons Education Select Committee call for evidence: The impact of COVID-19 on education and children’s services inquiry’. Available at: https://rss.org.uk/RSS/media/File-library/Policy/RSS-response-to-Education-Select-Committee-exam-grades-24062020.pdf (Accessed: 18 August 2020).
Ofqual (2020) Analysis of Consulation Responses: Exceptional arrangements for exam grading and assessment in 2020, Available at: https://www.gov.uk/government/consultations/exceptional-arrangements-for-exam-grading-and-assessment-in-2020#history (Accessed: 18 August 2020)
Ofqual (2020) ‘Ofqual privacy impact statement: 2020 grading’, Available at: https://www.gov.uk/government/organisations/ofqual/about/personal-information-charter#privacy-impact-statement-for-2020-grading. (Accessed: 18 August 2020)
Binns, R. (2020) ‘This is an absurd interpretation of art 22. – The fact that teachers determine rank order is not sufficient, because rank order is an *input* to the algorithm, but the algo makes the final decision. – Reviewing *centres* is not same as reviewing individual student grades. https://t.co/6auQmAa8h6’, Twitter. Available at: https://twitter.com/RDBinns/status/1295050474027089921 (Accessed: 18 August 2020)
Information Commissioner’s Office (2020) Statement in response to exam results. ICO. Available at: https://ico.org.uk/about-the-ico/news-and-events/news-and-blogs/2020/08/statement-in-response-to-exam-results/ (Accessed: 18 August 2020).
Note: “The purpose of an advance notification is to allow for contradiction by 21 August 2020 (extended). In other words, this is not a final decision, but rather a draft decision. Before taking a final decision, we will take into account the views of the IBO.” Datatilsynet (2020) The Norwegian DPA intends to order rectification of IB grades, Datatilsynet. Available at: https://www.datatilsynet.no/en/news/2020/the-norwegian-dpa-intends-to-order-rectification-of-ib-grades/ (Accessed: 18 August 2020).
Datatilsynet (2020) ‘Advance notification of order to rectify unfairly processed and incorrect personal data – International Baccalaureate Organization’. Available at: https://www.datatilsynet.no/contentassets/04df776f85f64562945f1d261b4add1b/advance-notification-of-order-to-rectify-unfairly-processed-and-incorrect-personal-data.pdf (Accessed: 18 August 2020).
BBC News (2020) ‘“Huge mess” as exams appeal guidance withdrawn’, 16 August. Available at: https://www.bbc.com/news/education-53795831 (Accessed: 18 August 2020).

Authors: Elliot Jones

Cansu Safak

Can algorithms ever make the grade?

Building trust and consensus in the goals of the algorithm

Ensuring the efficacy and accuracy of algorithmic systems

Ensuring independent scrutiny

Showing evidence of appropriate risk and impact mitigation measures

Enforcing the legal compliance of algorithmic systems

Setting up systems of redress

Conclusion

Footnotes

Related content

Algorithm accountability