How Close Is AI to Human Marking Accuracy?

Picture of Marking.ai

What 150 AQA GCSE English scripts tell us - and what it means for your classroom

Marking.ai recently pitted a range of large-language-model (LLM) markers against real teacher grades on 150 authentic AQA GCSE English answers. We wanted to know one thing teachers always ask:

Can AI really mark as accurately as I do?

Below is a teacher-friendly summary of what we found, why it matters, and how you can use these insights to save time without sacrificing fairness or professional judgement.

 

NOTE: This study was concluded before GPT-5 was released however we have since concluded a new study on the model and will release findings shortly

 


The Study at a Glance

What we tested

Why it matters to teachers

150 student responses (Paper 1 & Paper 2) across all question types: MCQs, 4–12 mark analysis, 16 mark comparison essays, 40 mark creative pieces

Reflects the full spread of marking you face each term

Human marks = the gold standard

Provides a real-world benchmark

11 different LLMs, including Marking.ai’s first model, OpenAI’s GPT and Google Gemini variants, Anthropic Claude, etc.

Shows how today’s best AI stacks up

Three accuracy metrics: Mean Absolute Error (MAE), Correlation, Exact-Match %

Captures both “how close” and “how consistent”


The Headlines

  1. Top AI models are now, on average, < ±1 mark away from teachers.

    • GPT-o3 averaged 1.07 marks off over every question, with a 0.96 correlation to teacher scores. When considering questions 4 and 5 were 16 and 40 marks respectively that is incredibly accurate. And it’s only getting better.

  2. Almost half of all AI marks were identical  to the teacher’s.

    • GPT-o3 matched teachers 46% of the time; Gemini 2.5 Pro hit 45.6%.

  3. AI is strongest on objective or tightly-scaffolded questions.

    • Multiple-choice and 4-mark “retrieve the facts” tasks were marked 100% correctly by most top-tier models.

  4. Extended creative writing is still the toughest nut to crack.

    • Even the best AI was ±3 marks on a 40-mark task and hit the exact score only ~10 % of the time.

  5. Our first Marking.ai engine was good - but upgradeable.

    • Our first “Control” model was sitting mid-table (MAE 1.61; 38% exact). Swapping in GPT-o3 cuts the average error by a third and raises exact matches to nearly 1-in-2 scripts. A huge upgrade for our users.


What This Means for Your Marking Load



Task type

Teacher pain-point

GPT-o3 (Marking.ai ‘s current model)

Average AI model tested

Practical takeaway

MCQs & 4-mark retrieval

Tedious tick-boxing that steals planning time

Perfect - hits every mark on our dataset

Near-perfect accuracy across most models

Let AI auto-mark; just spot-check.

Short-answer (4–8 marks)

Dozens of scripts, repetitive criteria

Off by 0.2  marks in 4 out of 5 answers

AI off by < 1 mark roughly two-thirds of the time

AI first pass + quick skim saves hours.

8–12 mark analysis

Applying level descriptors consistently

Usually within 1 mark; strong rubric alignment

Typically within 2 marks; borderline calls need a look

Use AI for draft scores & comments, adjust borderlines.

16-mark comparison essay

High cognitive load, fine grade boundaries

Within ~2 marks on average, good banding

AI within 2 marks on average; still needs oversight

AI bands essays; teacher reviews top & bottom scripts.

40-mark creative writing

Most subjective; biggest time sink

Within 2.8 marks (± 7% of total marks)

Within 4 marks (± 10% of total)

AI provides rubric-aligned draft mark & feedback; teacher fine-tunes final grade.

 


Why Some Models Out-Scored Others

  • Reasoning optimisation: GPT-o3 is tuned specifically for multi-step logic, mirroring a teacher’s rubric-checking process.

  • Bias calibration: Claude tended to under-mark by ~1.5; GPT-4o-mini over-marked by ~1.1. A tiny consistent bias quickly becomes big grade noise.

  • Context window: Long creative pieces plus detailed mark schemes blow past smaller models’ attention limits, reducing accuracy.


Using AI Marking Safely and Effectively

  1. Keep the teacher in the loop.
    AI’s first pass can cut your workload by up to 50%, but the DfE reminds us teachers must review final grades.


  2. Mark a handful, moderate them and then upload the rest
    Uploading a few submissions, and moderating them helps you identify where the marking guide could be tweaked to assist the AI. After you’ve updated the guide in-app and feel good, upload the rest of the submissions and only spot checks remain.


  3. Lean hardest on AI where the rubric is crystal clear.
    Let it shoulder the retrieval-style, short-analysis tasks and essays so you can focus on anything more nuanced.


  4. Make AI show its working.
    Marking.ai’s question-by-question mark breakdown means every awarded mark links back to a rubric point - exactly what moderators want.


  5. Use the justification as feedback gold.
    Students love immediate, specific pointers. Marking AI drafts these in seconds, you polish the nuance and can share it directly with your student through a link (no student login required).



What’s Next for Marking.ai

  • Model upgrade: We’ve already integrated GPT-o3-level reasoning into production, and continue to test the latest models (Hint: By the time you read this we are likely using GPT-5 😉) for opportunities to improve accuracy. 
  • Continuous fine-tuning: We continually, manually collect new human-marked scripts which feeds back to sharpen the AI.

  • Teacher controls: ‘Remark’ and ‘enhance feedback’ options have just hit production to put you firmly in charge.


Bottom Line for Teachers

AI marking isn’t replacing your expertise - it’s replacing your drudge work.

  • Objective tasks? Let the AI handle them.

  • Subjective nuance? Stay in the driver’s seat, but with a GPS that’s already plotted the route.

  • Feedback for pupils? Faster, richer, instantly tied to the rubric.

The latest data shows AI can now mark almost as accurately as a trained examiner for much of the GCSE English paper - and it’s only getting better. Harness it thoughtfully, and you gain back evenings, reduce marking fatigue, and deliver more timely feedback than ever before.

 

Ready to see it in action? Book a 15 minute demo or try a free pilot set of scripts by signing up.

 

ABOUT AUTHOR

Marking.ai

Our blog posts are created by our skilled team of Marking.ai content creators. We aim to provide you with informative, insightful and industry related content. If you have any queries, please reach out to us at connect@marking.ai.

SUBMIT YOUR COMMENT

Great updates

Subscribe to our email newsletter today!