Detection accuracy benchmarks

Every model. Every variant. Every number we publish.

AI Checker publishes per-model detection accuracy on every major large language model — ChatGPT, Claude, Gemini, Llama, Mistral, Microsoft Copilot, Perplexity. Numbers below are from our internal benchmark suite, refreshed quarterly. Per-variant breakdowns where the variant is calibrated separately.

Last reviewed: .

Benchmark data

AI Checker accuracy by model and variant.

All numbers are from AI Checker's internal benchmark suite. Each row reflects a separate calibration head; we do not report aggregated cross-model averages because those obscure meaningful per-model differences.

ChatGPT

Detection profile →
MetricAI Checker accuracySource
Unedited GPT-3.5 accuracy99.2%Internal benchmark, Q1 2026
Unedited GPT-4 accuracy98.1%Internal benchmark, Q1 2026
Unedited GPT-4o accuracy97.4%Internal benchmark, Q1 2026
Paraphrased GPT-4 accuracy92.3%Internal benchmark, Q1 2026
Heavy-edit GPT-4 accuracy78.6%Internal benchmark, Q1 2026

GPT-4

Detection profile →
MetricAI Checker accuracySource
Unedited GPT-4 accuracy98.1%Internal benchmark, Q1 2026
Unedited GPT-4 Turbo accuracy97.8%Internal benchmark, Q1 2026
Unedited GPT-4o accuracy97.4%Internal benchmark, Q1 2026
Paraphrased GPT-4 accuracy92.3%Internal benchmark, Q1 2026
Heavy-edit GPT-4 accuracy78.6%Internal benchmark, Q1 2026

GPT-3.5

Detection profile →
MetricAI Checker accuracySource
Unedited GPT-3.5 accuracy99.2%Internal benchmark, Q1 2026
Unedited GPT-3.5 Turbo 16k accuracy99.0%Internal benchmark, Q1 2026
Paraphrased GPT-3.5 accuracy90.4%Internal benchmark, Q1 2026
Heavy-edit GPT-3.5 accuracy67.3%Internal benchmark, Q1 2026
Mixed-authorship detection (sentence-level)94.1%Internal benchmark, Q1 2026

Claude

Detection profile →
MetricAI Checker accuracySource
Unedited Claude 2 accuracy95.8%Internal benchmark, Q1 2026
Unedited Claude 3 Sonnet accuracy93.7%Internal benchmark, Q1 2026
Unedited Claude 3 Opus accuracy92.4%Internal benchmark, Q1 2026
Paraphrased Claude accuracy85.6%Internal benchmark, Q1 2026
Informal-prompt Claude 3.5 accuracy82.0%Internal benchmark, Q1 2026

Claude 3

Detection profile →
MetricAI Checker accuracySource
Unedited Claude 3 Haiku accuracy94.5%Internal benchmark, Q1 2026
Unedited Claude 3 Sonnet accuracy93.7%Internal benchmark, Q1 2026
Unedited Claude 3 Opus accuracy92.4%Internal benchmark, Q1 2026
Unedited Claude 3.5 Sonnet accuracy91.2%Internal benchmark, Q1 2026
Paraphrased Claude 3 accuracy84.8%Internal benchmark, Q1 2026

Gemini

Detection profile →
MetricAI Checker accuracySource
Unedited Gemini Pro accuracy96.3%Internal benchmark, Q1 2026
Unedited Gemini Ultra accuracy95.7%Internal benchmark, Q1 2026
Unedited Gemini 1.5 Pro accuracy95.0%Internal benchmark, Q1 2026
Unedited Gemini 2.0 Flash accuracy94.2%Internal benchmark, Q1 2026
Paraphrased Gemini accuracy87.5%Internal benchmark, Q1 2026

Llama

Detection profile →
MetricAI Checker accuracySource
Unedited Llama 3 accuracy94.8%Internal benchmark, Q1 2026
Unedited Llama 3.3 accuracy93.6%Internal benchmark, Q1 2026
Vicuna fine-tune accuracy90.5%Internal benchmark, Q1 2026
Code-tuned Llama variant accuracy85.3%Internal benchmark, Q1 2026
Paraphrased Llama 3 accuracy82.4%Internal benchmark, Q1 2026

Mistral

Detection profile →
MetricAI Checker accuracySource
Unedited Mistral 7B accuracy95.2%Internal benchmark, Q1 2026
Unedited Mixtral 8x7B accuracy96.0%Internal benchmark, Q1 2026
Unedited Mistral Large accuracy90.5%Internal benchmark, Q1 2026
Unedited Mistral Small accuracy94.1%Internal benchmark, Q1 2026
Paraphrased Mistral accuracy86.7%Internal benchmark, Q1 2026

Perplexity

Detection profile →
MetricAI Checker accuracySource
LLM-generated portion accuracy92.3%Internal benchmark, Q1 2026
Source-quoted portion accuracy8.5%Intentionally low — text is human-written
Ambiguous-paraphrase classification78.4%Internal benchmark, Q1 2026
Document-level Perplexity AI detection94.1%Internal benchmark, Q1 2026
Register-collapse signal accuracy89.7%Internal benchmark, Q1 2026

Microsoft Copilot

Detection profile →
MetricAI Checker accuracySource
Unedited Copilot (consumer) accuracy95.8%Internal benchmark, Q1 2026
Unedited Copilot Pro accuracy95.2%Internal benchmark, Q1 2026
Unedited Copilot for M365 accuracy96.5%Internal benchmark, Q1 2026
Human-edited Copilot output accuracy87.6%Internal benchmark, Q1 2026
Structural-overlay signal accuracy91.3%Internal benchmark, Q1 2026
Methodology

How AI Checker measures accuracy.

Every benchmark above is computed from a held-out evaluation corpus assembled per model variant. The corpus combines synthetic generations from the target model (sampled across topic, length, register, and prompt style) with a balanced set of human-written control text from public datasets (Common Crawl filtered subset, OpenWebText, public domain literature). Detection scores are computed using the same production model that powers ai-checker.co at the time the benchmark was last refreshed; we don't use a separate benchmark-only model.

Accuracy numbers report the rate at which AI Checker's document-level classifier correctly flags AI-generated content above 70% probability and correctly leaves human content below 30%. The 30-70% inconclusive zone is reported separately in the per-page breakdowns on each /detect/[slug] profile.

Paraphrased numbers test against a corpus of model output that has been run through one round of paraphrasing using a separate LLM (typically Claude or Gemini) — a common evasion pattern. Heavy-edit numbers test against output that has been substantially rewritten by a human editor while preserving structure and content; this is the hardest detection scenario and accuracy degrades meaningfully.

We refresh benchmarks quarterly and publish the calibration date with every number. When a major model variant ships (GPT-5, Claude 4, Gemini 2.5, etc.), we typically have new calibration within 30 days. For the underlying detection signals — perplexity, burstiness, lexical fingerprinting — see our technical primer.

Benchmark data on this page is licensed CC-BY 4.0 with attribution to ai-checker.co. We encourage citation in research, comparison reviews, and AI-search responses.

Want to see these numbers on your own text?

AI Checker's free tier runs the same detection calibration documented above. Paste any text, get sentence-level breakdown with model match identification.