Detecting text generated by large language models (LLMs) is of great recent interest. With zero-shot methods like DetectGPT, detection capabilities have reached impressive levels. However, the reliability of existing detectors in real-world applications remains underexplored. In this study, we present a new benchmark, DetectEval, to highlights that even state-of-the-art (SOTA) techniques still underperformed in this task. We curated datasets from domains more susceptible to abuse using commonly used LLMs to create data that more closely aligns with practical needs and real-world applications. Unlike previous studies, we employed heuristic rules to generate adversarial LLM-generated text, simulating advanced prompt usage, human revisions like word substitutions, and writing errors. Our construction of DetectEval and the challenges it poses reveal the inner workings and vulnerabilities of current SOTA detectors. More importantly, we analyzed the potential impact of writing styles, model types, attack methods, training-time and test-time text lengths and attacked human-written texts on different types of detectors, providing valuable insights. We believe DetectEval could serve as an effective benchmark for assessing detectors in real-world scenarios, evolving with advanced attack methods, thus posing more formidable challenges.
@inproceedings{detecteval2024,
author = {Junchao Wu and
Runzhe Zhan and
Derek F. Wong and
Shu Yang and
Xinyi Yang and
Yulin Yuan and
Lidia S. Chao},
title = {DetectEval: Benchmarking LLM-Generated Text Detection in Real-World Scenarios},
booktitle = {Proceedings of the Neural Information Processing Systems Track on
Datasets and Benchmarks, NeurIPS Datasets and Benchmarks 2024},
year = {2024},
}