The Hidden Balancing Act: How A Level Grades Are Really Awarded

Introduction: Grading as Art and Science

Every summer, A Level results spark a flurry of headlines, debates, and judgments about the state of England’s education system. Politicians argue over whether rising grades prove improvement or “grade inflation.” Newspapers ask if exams are tougher or easier than in the past. Parents and students anxiously wonder what those letters really mean.

From the outside, grading can appear to be a purely technical process. Examination boards publish formal protocols, set grade boundaries, and issue statistical analyses. The procedures resemble an algorithm: scripts are marked, thresholds are plotted, and distribution curves are monitored. Yet behind the mathematics lies something more human and more complex.

As Paul E. Newton (2022) emphasises, A Level grading is not a mechanical formula but a balancing act. It combines examiner judgment, statistical modelling, and wider cultural pressures. It is at once scientific and interpretive, shaped both by data and by professional expertise. To understand how A Levels are really awarded, we must move past myths of fixed quotas or rigid statistical caps and recognise the subtle interplay between consistency, fairness, and expectation.

Attainment-Referencing Explained

The central mechanism for awarding A Levels is known as attainment-referencing. This approach anchors grades not to fixed quotas (norm-referencing) or universal descriptors (pure criterion-referencing), but to comparisons with previous years.

Examiner judgment at the centre

Senior examiners play the pivotal role. They review a range of current scripts alongside “archive scripts” from earlier years, especially at the crucial grade boundaries. For example, if a script from a past cohort was judged to be a borderline B, examiners ask whether a similar script from this year demonstrates equivalent attainment.

This is not mechanical number-crunching. It requires professional judgment, subject expertise, and interpretive skill. Examiners must decide whether performance has been maintained, declined, or improved in a way that warrants boundary adjustments.

Statistical checks as safeguards

Alongside judgment, examination boards deploy statistical models to safeguard consistency. Using GCSE results and other data, prediction matrices estimate what proportion of students might be expected to achieve each grade in a given year. These models do not dictate the outcomes but act as “sense checks.”

If examiners’ proposed grade boundaries would result in unusually high or low proportions of top grades compared to predictions, the discrepancy is investigated. The balance ensures that grading is neither arbitrary nor rigidly fixed. Judgment ensures sensitivity to actual student work, while statistics prevent dramatic or unjustified fluctuations.

This dual system — judgment plus statistics — has defined A Level grading for over seventy years. It underpins the credibility of results and protects students from unfair outcomes while allowing standards to evolve gradually.

Criterion-Referencing Dreams vs. Reality

In the 1980s and 1990s, policymakers championed criterion-referencing as an alternative model. The idea was simple: students should be assessed against fixed performance criteria, not relative comparisons with others.

The promise of fixed standards

This approach promised clarity. Grade descriptions would be published, making it clear what counted as an A, B, or C. Teachers and students could prepare with a clear target, and universities could interpret results consistently across years. Criterion-referencing appeared to offer transparency and fairness.

Why it failed in practice

In reality, criterion-referencing proved unworkable at scale. Written descriptors could not capture the diversity of actual student responses. Examiners still had to interpret borderline cases, and interpretations inevitably varied.

The problem was particularly acute across subjects. A sophisticated literary essay in English cannot be judged by the same type of fixed descriptors as a physics solution or a mathematics proof. Attempts to design universal criteria quickly broke down.

As a result, criterion-referencing never fully displaced attainment-referencing. Examiners continued to rely on comparisons with previous cohorts, using descriptors as supporting guidance rather than absolute rules. The aspiration to “fix” standards permanently remained more rhetorical than real.

The Climate of Expectation

Even within attainment-referencing, grades are not immune to broader cultural pressures. From the 1980s onwards, A Level results showed a steady rise in pass rates and top grades. Critics spoke of “grade inflation” and falling standards. Newton suggests a different explanation: the influence of a climate of expectation.

The push for improvement

Governments set targets for schools, league tables measured performance, and media attention intensified public scrutiny. Parents wanted evidence of continual improvement, and teachers were under pressure to demonstrate rising success rates. Within this environment, examiners marking borderline scripts faced subtle but powerful expectations.

The examiner’s dilemma

When a candidate’s work teetered on the edge between two grades, the prevailing cultural belief that outcomes should show progress could tip the decision. Examiners were not consciously inflating marks but responding to a context that favoured generosity. Over time, these small, marginal decisions accumulated into a systemic upward drift.

International comparisons

This phenomenon is not unique to England. The United States has repeatedly revised the SAT to counter perceptions of grade inflation. The International Baccalaureate has also wrestled with tensions between maintaining global standards and responding to local expectations.

England’s system, however, stands out for its ongoing reliance on examiner judgment. Unlike systems that lean more heavily on statistical scaling, A Levels deliberately preserve a central role for professional discretion. This protects the human dimension of assessment but also makes the system more permeable to cultural and political pressures.

Limitations and Built-In Uncertainty

No grading system can measure learning with perfect precision. Newton highlights several limitations and sources of uncertainty that are intrinsic to the process.

Change over time

Syllabuses evolve, teaching methods shift, and the composition of student cohorts changes. Comparing a student in 2023 with one in 1973 is fraught with difficulty, as the knowledge assessed and the social context are no longer the same. Attainment-referencing preserves continuity year by year but cannot provide absolute comparability across decades.

Reliance on examiner judgment

Even with rigorous training, examiners are human. Different markers may interpret borderline responses differently, particularly in essay-based subjects where evaluation involves qualitative judgment. Statistical monitoring helps identify inconsistencies but cannot eliminate variation.

Pressures of reform and crisis

Moments of systemic change intensify uncertainty. The introduction of new A Level specifications in the 2010s required careful application of comparable outcomes to protect the first cohorts. During the COVID-19 pandemic, when exams were cancelled, the attempt to replace them with statistical models provoked public outcry and revealed how fragile trust can be when grades are divorced from examiner judgment.

In short, uncertainty is not a flaw to be eradicated but a condition to be managed. Recognising this helps schools, policymakers, and the public interpret results with realism rather than alarm.

Conclusion: The Balancing Act of Fairness

The story of A Level grading is not one of fixed quotas, statistical ceilings, or unchanging standards. It is the story of a balancing act. Examiners weigh current scripts against historical benchmarks, regulators balance consistency with fairness, and society balances its demand for improvement with its need for credibility.

Newton’s analysis underscores that fairness in grading cannot be engineered by algorithm or decree. It depends on the professional judgment of examiners, informed by data but not reduced to it. It requires transparency about limitations and humility about uncertainty.

International comparisons confirm that all systems face similar challenges. England’s distinctive reliance on examiner discretion brings risks but also strengths: it resists the reduction of learning to numbers alone and maintains a human element at the heart of assessment.

Ultimately, the credibility of A Levels rests on trust in this balancing act. The grades students receive are not statistical artefacts or arbitrary judgments, but the outcome of a complex, carefully moderated process. For educators, policymakers, and academics, the lesson is clear: meaningful assessment is always a hybrid of art and science, sustained not by myths or fixes but by continuous, professional negotiation of fairness.