Detection accuracy benchmarks

Every model. Every variant. Every number we publish.

AI Checker publishes per-model detection accuracy on every major large language model — ChatGPT, Claude, Gemini, Llama, Mistral, Microsoft Copilot, Perplexity. Numbers below are from our internal benchmark suite, refreshed quarterly. Per-variant breakdowns where the variant is calibrated separately.

Last reviewed: May 9, 2026.

Benchmark data

AI Checker accuracy by model and variant.

All numbers are from AI Checker's internal benchmark suite. Each row reflects a separate calibration head; we do not report aggregated cross-model averages because those obscure meaningful per-model differences.

ChatGPT

Detection profile →

Metric	AI Checker accuracy	Source
Unedited GPT-3.5 accuracy	99.2%	Internal benchmark, Q1 2026
Unedited GPT-4 accuracy	98.1%	Internal benchmark, Q1 2026
Unedited GPT-4o accuracy	97.4%	Internal benchmark, Q1 2026
Paraphrased GPT-4 accuracy	92.3%	Internal benchmark, Q1 2026
Heavy-edit GPT-4 accuracy	78.6%	Internal benchmark, Q1 2026

GPT-4

Detection profile →

Metric	AI Checker accuracy	Source
Unedited GPT-4 accuracy	98.1%	Internal benchmark, Q1 2026
Unedited GPT-4 Turbo accuracy	97.8%	Internal benchmark, Q1 2026
Unedited GPT-4o accuracy	97.4%	Internal benchmark, Q1 2026
Paraphrased GPT-4 accuracy	92.3%	Internal benchmark, Q1 2026
Heavy-edit GPT-4 accuracy	78.6%	Internal benchmark, Q1 2026

GPT-3.5

Detection profile →

Metric	AI Checker accuracy	Source
Unedited GPT-3.5 accuracy	99.2%	Internal benchmark, Q1 2026
Unedited GPT-3.5 Turbo 16k accuracy	99.0%	Internal benchmark, Q1 2026
Paraphrased GPT-3.5 accuracy	90.4%	Internal benchmark, Q1 2026
Heavy-edit GPT-3.5 accuracy	67.3%	Internal benchmark, Q1 2026
Mixed-authorship detection (sentence-level)	94.1%	Internal benchmark, Q1 2026

Claude

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Claude 2 accuracy	95.8%	Internal benchmark, Q1 2026
Unedited Claude 3 Sonnet accuracy	93.7%	Internal benchmark, Q1 2026
Unedited Claude 3 Opus accuracy	92.4%	Internal benchmark, Q1 2026
Paraphrased Claude accuracy	85.6%	Internal benchmark, Q1 2026
Informal-prompt Claude 3.5 accuracy	82.0%	Internal benchmark, Q1 2026

Claude 3

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Claude 3 Haiku accuracy	94.5%	Internal benchmark, Q1 2026
Unedited Claude 3 Sonnet accuracy	93.7%	Internal benchmark, Q1 2026
Unedited Claude 3 Opus accuracy	92.4%	Internal benchmark, Q1 2026
Unedited Claude 3.5 Sonnet accuracy	91.2%	Internal benchmark, Q1 2026
Paraphrased Claude 3 accuracy	84.8%	Internal benchmark, Q1 2026

Gemini

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Gemini Pro accuracy	96.3%	Internal benchmark, Q1 2026
Unedited Gemini Ultra accuracy	95.7%	Internal benchmark, Q1 2026
Unedited Gemini 1.5 Pro accuracy	95.0%	Internal benchmark, Q1 2026
Unedited Gemini 2.0 Flash accuracy	94.2%	Internal benchmark, Q1 2026
Paraphrased Gemini accuracy	87.5%	Internal benchmark, Q1 2026

Llama

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Llama 3 accuracy	94.8%	Internal benchmark, Q1 2026
Unedited Llama 3.3 accuracy	93.6%	Internal benchmark, Q1 2026
Vicuna fine-tune accuracy	90.5%	Internal benchmark, Q1 2026
Code-tuned Llama variant accuracy	85.3%	Internal benchmark, Q1 2026
Paraphrased Llama 3 accuracy	82.4%	Internal benchmark, Q1 2026

Mistral

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Mistral 7B accuracy	95.2%	Internal benchmark, Q1 2026
Unedited Mixtral 8x7B accuracy	96.0%	Internal benchmark, Q1 2026
Unedited Mistral Large accuracy	90.5%	Internal benchmark, Q1 2026
Unedited Mistral Small accuracy	94.1%	Internal benchmark, Q1 2026
Paraphrased Mistral accuracy	86.7%	Internal benchmark, Q1 2026

Perplexity

Detection profile →

Metric	AI Checker accuracy	Source
LLM-generated portion accuracy	92.3%	Internal benchmark, Q1 2026
Source-quoted portion accuracy	8.5%	Intentionally low — text is human-written
Ambiguous-paraphrase classification	78.4%	Internal benchmark, Q1 2026
Document-level Perplexity AI detection	94.1%	Internal benchmark, Q1 2026
Register-collapse signal accuracy	89.7%	Internal benchmark, Q1 2026

Microsoft Copilot

Detection profile →

Metric	AI Checker accuracy	Source
Unedited Copilot (consumer) accuracy	95.8%	Internal benchmark, Q1 2026
Unedited Copilot Pro accuracy	95.2%	Internal benchmark, Q1 2026
Unedited Copilot for M365 accuracy	96.5%	Internal benchmark, Q1 2026
Human-edited Copilot output accuracy	87.6%	Internal benchmark, Q1 2026
Structural-overlay signal accuracy	91.3%	Internal benchmark, Q1 2026

Methodology

How AI Checker measures accuracy.

Every benchmark above is computed from a held-out evaluation corpus assembled per model variant. The corpus combines synthetic generations from the target model (sampled across topic, length, register, and prompt style) with a balanced set of human-written control text from public datasets (Common Crawl filtered subset, OpenWebText, public domain literature). Detection scores are computed using the same production model that powers ai-checker.co at the time the benchmark was last refreshed; we don't use a separate benchmark-only model.

Accuracy numbers report the rate at which AI Checker's document-level classifier correctly flags AI-generated content above 70% probability and correctly leaves human content below 30%. The 30-70% inconclusive zone is reported separately in the per-page breakdowns on each /detect/[slug] profile.

Paraphrased numbers test against a corpus of model output that has been run through one round of paraphrasing using a separate LLM (typically Claude or Gemini) — a common evasion pattern. Heavy-edit numbers test against output that has been substantially rewritten by a human editor while preserving structure and content; this is the hardest detection scenario and accuracy degrades meaningfully.

We refresh benchmarks quarterly and publish the calibration date with every number. When a major model variant ships (GPT-5, Claude 4, Gemini 2.5, etc.), we typically have new calibration within 30 days. For the underlying detection signals — perplexity, burstiness, lexical fingerprinting — see our technical primer.

Benchmark data on this page is licensed CC-BY 4.0 with attribution to ai-checker.co. We encourage citation in research, comparison reviews, and AI-search responses.

Want to see these numbers on your own text?

AI Checker's free tier runs the same detection calibration documented above. Paste any text, get sentence-level breakdown with model match identification.

Try the detector Read the technical primer