August 9, 2025 Marking.ai

How Close Is AI to Human Marking Accuracy?

Research OpenAI Gemini Claude Artificial intelligence AI in education English Language

What 150 AQA GCSE English scripts tell us - and what it means for your classroom

Marking.ai recently pitted a range of large-language-model (LLM) markers against real teacher grades on 150 authentic AQA GCSE English answers. We wanted to know one thing teachers always ask:

Can AI really mark as accurately as I do?

Below is a teacher-friendly summary of what we found, why it matters, and how you can use these insights to save time without sacrificing fairness or professional judgement.

NOTE: This study was concluded before GPT-5 was released however we have since concluded a new study on the model and will release findings shortly

The Study at a Glance

What we tested	Why it matters to teachers
150 student responses (Paper 1 & Paper 2) across all question types: MCQs, 4–12 mark analysis, 16 mark comparison essays, 40 mark creative pieces	Reflects the full spread of marking you face each term
Human marks = the gold standard	Provides a real-world benchmark
11 different LLMs, including Marking.ai’s first model, OpenAI’s GPT and Google Gemini variants, Anthropic Claude, etc.	Shows how today’s best AI stacks up
Three accuracy metrics: Mean Absolute Error (MAE), Correlation, Exact-Match %	Captures both “how close” and “how consistent”

The Headlines

Top AI models are now, on average, < ±1 mark away from teachers.
- GPT-o3 averaged 1.07 marks off over every question, with a 0.96 correlation to teacher scores. When considering questions 4 and 5 were 16 and 40 marks respectively that is incredibly accurate. And it’s only getting better.
Almost half of all AI marks were identical to the teacher’s.
- GPT-o3 matched teachers 46% of the time; Gemini 2.5 Pro hit 45.6%.
AI is strongest on objective or tightly-scaffolded questions.
- Multiple-choice and 4-mark “retrieve the facts” tasks were marked 100% correctly by most top-tier models.
Extended creative writing is still the toughest nut to crack.
- Even the best AI was ±3 marks on a 40-mark task and hit the exact score only ~10 % of the time.
Our first Marking.ai engine was good - but upgradeable.
- Our first “Control” model was sitting mid-table (MAE 1.61; 38% exact). Swapping in GPT-o3 cuts the average error by a third and raises exact matches to nearly 1-in-2 scripts. A huge upgrade for our users.

What This Means for Your Marking Load

Task type	Teacher pain-point	GPT-o3 (Marking.ai ‘s current model)	Average AI model tested	Practical takeaway
MCQs & 4-mark retrieval	Tedious tick-boxing that steals planning time	Perfect - hits every mark on our dataset	Near-perfect accuracy across most models	Let AI auto-mark; just spot-check.
Short-answer (4–8 marks)	Dozens of scripts, repetitive criteria	Off by 0.2 marks in 4 out of 5 answers	AI off by < 1 mark roughly two-thirds of the time	AI first pass + quick skim saves hours.
8–12 mark analysis	Applying level descriptors consistently	Usually within 1 mark; strong rubric alignment	Typically within 2 marks; borderline calls need a look	Use AI for draft scores & comments, adjust borderlines.
16-mark comparison essay	High cognitive load, fine grade boundaries	Within ~2 marks on average, good banding	AI within 2 marks on average; still needs oversight	AI bands essays; teacher reviews top & bottom scripts.
40-mark creative writing	Most subjective; biggest time sink	Within 2.8 marks (± 7% of total marks)	Within 4 marks (± 10% of total)	AI provides rubric-aligned draft mark & feedback; teacher fine-tunes final grade.

Why Some Models Out-Scored Others

Reasoning optimisation: GPT-o3 is tuned specifically for multi-step logic, mirroring a teacher’s rubric-checking process.
Bias calibration: Claude tended to under-mark by ~1.5; GPT-4o-mini over-marked by ~1.1. A tiny consistent bias quickly becomes big grade noise.
Context window: Long creative pieces plus detailed mark schemes blow past smaller models’ attention limits, reducing accuracy.

Using AI Marking Safely and Effectively

Keep the teacher in the loop.
AI’s first pass can cut your workload by up to 50%, but the DfE reminds us teachers must review final grades.
Mark a handful, moderate them and then upload the rest
Uploading a few submissions, and moderating them helps you identify where the marking guide could be tweaked to assist the AI. After you’ve updated the guide in-app and feel good, upload the rest of the submissions and only spot checks remain.
Lean hardest on AI where the rubric is crystal clear.
Let it shoulder the retrieval-style, short-analysis tasks and essays so you can focus on anything more nuanced.
Make AI show its working.
Marking.ai’s question-by-question mark breakdown means every awarded mark links back to a rubric point - exactly what moderators want.
Use the justification as feedback gold.
Students love immediate, specific pointers. Marking AI drafts these in seconds, you polish the nuance and can share it directly with your student through a link (no student login required).

What’s Next for Marking.ai

Model upgrade: We’ve already integrated GPT-o3-level reasoning into production, and continue to test the latest models (Hint: By the time you read this we are likely using GPT-5 😉) for opportunities to improve accuracy.
Continuous fine-tuning: We continually, manually collect new human-marked scripts which feeds back to sharpen the AI.
Teacher controls: ‘Remark’ and ‘enhance feedback’ options have just hit production to put you firmly in charge.

Bottom Line for Teachers

AI marking isn’t replacing your expertise - it’s replacing your drudge work.

Objective tasks? Let the AI handle them.
Subjective nuance? Stay in the driver’s seat, but with a GPS that’s already plotted the route.
Feedback for pupils? Faster, richer, instantly tied to the rubric.

The latest data shows AI can now mark almost as accurately as a trained examiner for much of the GCSE English paper - and it’s only getting better. Harness it thoughtfully, and you gain back evenings, reduce marking fatigue, and deliver more timely feedback than ever before.

Ready to see it in action? Book a 15 minute demo or try a free pilot set of scripts by signing up.

How Close Is AI to Human Marking Accuracy?

What 150 AQA GCSE English scripts tell us - and what it means for your classroom

The Study at a Glance

The Headlines

What This Means for Your Marking Load

Why Some Models Out-Scored Others

Using AI Marking Safely and Effectively

What’s Next for Marking.ai

Bottom Line for Teachers

Ready to see it in action? Book a 15 minute demo or try a free pilot set of scripts by signing up.

ABOUT AUTHOR

SUBMIT YOUR COMMENT

How Close Is AI to Human Marking Accuracy?

Share

What 150 AQA GCSE English scripts tell us - and what it means for your classroom

The Study at a Glance

The Headlines

What This Means for Your Marking Load

Why Some Models Out-Scored Others

Using AI Marking Safely and Effectively

What’s Next for Marking.ai

Bottom Line for Teachers

Ready to see it in action? Book a 15 minute demo or try a free pilot set of scripts by signing up.

ABOUT AUTHOR

SUBMIT YOUR COMMENT

Subscribe to our email newsletter today!