Every benchmark above is computed from a held-out evaluation corpus assembled per model variant. The corpus combines synthetic generations from the target model (sampled across topic, length, register, and prompt style) with a balanced set of human-written control text from public datasets (Common Crawl filtered subset, OpenWebText, public domain literature). Detection scores are computed using the same production model that powers ai-checker.co at the time the benchmark was last refreshed; we don't use a separate benchmark-only model.
Accuracy numbers report the rate at which AI Checker's document-level classifier correctly flags AI-generated content above 70% probability and correctly leaves human content below 30%. The 30-70% inconclusive zone is reported separately in the per-page breakdowns on each /detect/[slug] profile.
Paraphrased numbers test against a corpus of model output that has been run through one round of paraphrasing using a separate LLM (typically Claude or Gemini) — a common evasion pattern. Heavy-edit numbers test against output that has been substantially rewritten by a human editor while preserving structure and content; this is the hardest detection scenario and accuracy degrades meaningfully.
We refresh benchmarks quarterly and publish the calibration date with every number. When a major model variant ships (GPT-5, Claude 4, Gemini 2.5, etc.), we typically have new calibration within 30 days. For the underlying detection signals — perplexity, burstiness, lexical fingerprinting — see our technical primer.
Benchmark data on this page is licensed CC-BY 4.0 with attribution to ai-checker.co. We encourage citation in research, comparison reviews, and AI-search responses.