What 150 AQA GCSE English scripts tell us - and what it means for your classroom
Marking.ai recently pitted a range of large-language-model (LLM) markers against real teacher grades on 150 authentic AQA GCSE English answers. We wanted to know one thing teachers always ask:
Can AI really mark as accurately as I do?
Below is a teacher-friendly summary of what we found, why it matters, and how you can use these insights to save time without sacrificing fairness or professional judgement.
NOTE: This study was concluded before GPT-5 was released however we have since concluded a new study on the model and will release findings shortly
The Study at a Glance
What we tested |
Why it matters to teachers |
150 student responses (Paper 1 & Paper 2) across all question types: MCQs, 4–12 mark analysis, 16 mark comparison essays, 40 mark creative pieces |
Reflects the full spread of marking you face each term |
Human marks = the gold standard |
Provides a real-world benchmark |
11 different LLMs, including Marking.ai’s first model, OpenAI’s GPT and Google Gemini variants, Anthropic Claude, etc. |
Shows how today’s best AI stacks up |
Three accuracy metrics: Mean Absolute Error (MAE), Correlation, Exact-Match % |
Captures both “how close” and “how consistent” |
The Headlines
- Top AI models are now, on average, < ±1 mark away from teachers.
- GPT-o3 averaged 1.07 marks off over every question, with a 0.96 correlation to teacher scores. When considering questions 4 and 5 were 16 and 40 marks respectively that is incredibly accurate. And it’s only getting better.
- GPT-o3 averaged 1.07 marks off over every question, with a 0.96 correlation to teacher scores. When considering questions 4 and 5 were 16 and 40 marks respectively that is incredibly accurate. And it’s only getting better.
- Almost half of all AI marks were identical to the teacher’s.
- GPT-o3 matched teachers 46% of the time; Gemini 2.5 Pro hit 45.6%.
- GPT-o3 matched teachers 46% of the time; Gemini 2.5 Pro hit 45.6%.
- AI is strongest on objective or tightly-scaffolded questions.
- Multiple-choice and 4-mark “retrieve the facts” tasks were marked 100% correctly by most top-tier models.
- Multiple-choice and 4-mark “retrieve the facts” tasks were marked 100% correctly by most top-tier models.
- Extended creative writing is still the toughest nut to crack.
- Even the best AI was ±3 marks on a 40-mark task and hit the exact score only ~10 % of the time.
- Even the best AI was ±3 marks on a 40-mark task and hit the exact score only ~10 % of the time.
- Our first Marking.ai engine was good - but upgradeable.
- Our first “Control” model was sitting mid-table (MAE 1.61; 38% exact). Swapping in GPT-o3 cuts the average error by a third and raises exact matches to nearly 1-in-2 scripts. A huge upgrade for our users.
- Our first “Control” model was sitting mid-table (MAE 1.61; 38% exact). Swapping in GPT-o3 cuts the average error by a third and raises exact matches to nearly 1-in-2 scripts. A huge upgrade for our users.
What This Means for Your Marking Load
Task type |
Teacher pain-point |
GPT-o3 (Marking.ai ‘s current model) |
Average AI model tested |
Practical takeaway |
MCQs & 4-mark retrieval |
Tedious tick-boxing that steals planning time |
Perfect - hits every mark on our dataset |
Near-perfect accuracy across most models |
Let AI auto-mark; just spot-check. |
Short-answer (4–8 marks) |
Dozens of scripts, repetitive criteria |
Off by 0.2 marks in 4 out of 5 answers |
AI off by < 1 mark roughly two-thirds of the time |
AI first pass + quick skim saves hours. |
8–12 mark analysis |
Applying level descriptors consistently |
Usually within 1 mark; strong rubric alignment |
Typically within 2 marks; borderline calls need a look |
Use AI for draft scores & comments, adjust borderlines. |
16-mark comparison essay |
High cognitive load, fine grade boundaries |
Within ~2 marks on average, good banding |
AI within 2 marks on average; still needs oversight |
AI bands essays; teacher reviews top & bottom scripts. |
40-mark creative writing |
Most subjective; biggest time sink |
Within 2.8 marks (± 7% of total marks) |
Within 4 marks (± 10% of total) |
AI provides rubric-aligned draft mark & feedback; teacher fine-tunes final grade. |
Why Some Models Out-Scored Others
- Reasoning optimisation: GPT-o3 is tuned specifically for multi-step logic, mirroring a teacher’s rubric-checking process.
- Bias calibration: Claude tended to under-mark by ~1.5; GPT-4o-mini over-marked by ~1.1. A tiny consistent bias quickly becomes big grade noise.
- Context window: Long creative pieces plus detailed mark schemes blow past smaller models’ attention limits, reducing accuracy.
Using AI Marking Safely and Effectively
- Keep the teacher in the loop.
AI’s first pass can cut your workload by up to 50%, but the DfE reminds us teachers must review final grades. - Mark a handful, moderate them and then upload the rest
Uploading a few submissions, and moderating them helps you identify where the marking guide could be tweaked to assist the AI. After you’ve updated the guide in-app and feel good, upload the rest of the submissions and only spot checks remain. - Lean hardest on AI where the rubric is crystal clear.
Let it shoulder the retrieval-style, short-analysis tasks and essays so you can focus on anything more nuanced. - Make AI show its working.
Marking.ai’s question-by-question mark breakdown means every awarded mark links back to a rubric point - exactly what moderators want. - Use the justification as feedback gold.
Students love immediate, specific pointers. Marking AI drafts these in seconds, you polish the nuance and can share it directly with your student through a link (no student login required).
What’s Next for Marking.ai
- Model upgrade: We’ve already integrated GPT-o3-level reasoning into production, and continue to test the latest models (Hint: By the time you read this we are likely using GPT-5 😉) for opportunities to improve accuracy.
- Continuous fine-tuning: We continually, manually collect new human-marked scripts which feeds back to sharpen the AI.
- Teacher controls: ‘Remark’ and ‘enhance feedback’ options have just hit production to put you firmly in charge.
Bottom Line for Teachers
AI marking isn’t replacing your expertise - it’s replacing your drudge work.
- Objective tasks? Let the AI handle them.
- Subjective nuance? Stay in the driver’s seat, but with a GPS that’s already plotted the route.
- Feedback for pupils? Faster, richer, instantly tied to the rubric.
The latest data shows AI can now mark almost as accurately as a trained examiner for much of the GCSE English paper - and it’s only getting better. Harness it thoughtfully, and you gain back evenings, reduce marking fatigue, and deliver more timely feedback than ever before.
Ready to see it in action? Book a 15 minute demo or try a free pilot set of scripts by signing up.
SUBMIT YOUR COMMENT