DetectEval: Benchmarking LLM-Generated Text Detection in Real-World Scenarios

NLP2CT Lab, Department of Computer and Information Science
University of Macau
nlp2ct.{junchao,runzhe,shuyang,xinyi}@gmail.com, {derekfw,yuanyl,lidiasc}@um.edu.mo

*Corresponding Author

Introduction

Detecting text generated by large language models (LLMs) is of great recent interest. With zero-shot methods like DetectGPT, detection capabilities have reached impressive levels. However, the reliability of existing detectors in real-world applications remains underexplored. In this study, we present a new benchmark, DetectEval, to highlights that even state-of-the-art (SOTA) techniques still underperformed in this task. We curated datasets from domains more susceptible to abuse using commonly used LLMs to create data that more closely aligns with practical needs and real-world applications. Unlike previous studies, we employed heuristic rules to generate adversarial LLM-generated text, simulating advanced prompt usage, human revisions like word substitutions, and writing errors. Our construction of DetectEval and the challenges it poses reveal the inner workings and vulnerabilities of current SOTA detectors. More importantly, we analyzed the potential impact of writing styles, model types, attack methods, training-time and test-time text lengths and attacked human-written texts on different types of detectors, providing valuable insights. We believe DetectEval could serve as an effective benchmark for assessing detectors in real-world scenarios, evolving with advanced attack methods, thus posing more formidable challenges.

The Framework of DetectEval

This figure illustrates the overall framework of DetectEval, where human-written samples are sourced from high-risk and abuse-prone domains. We employ commonly used and powerful LLMs to create LLM-generated samples. All samples undergo sophisticated attacks designed to mimic real-world scenarios and are diversified using an n-sentences data splitting technique to enhance the benchmark’s variety. DetectEval comprises four distinct tasks to evaluate the comprehensive detection abilities and robustness of the detectors, including (1) Robustness Assessment, (2) Adaptability Assessment, (3) Varying Text Length Assessment, (4) Real-World Human Writing Assessment.

Task 1: Robustness Assessment: Multi-Domain, Multi-LLM, and Multi-Attack Assessment. This task aims to evaluate the foundational performance of detectors in different domains, generators, and attack strategies, specifically assessing their robustness in various real-world scenarios. We use the average performance score as the metric for robustness capabilities assessment.

Task 2: Generalization Assessment. This task assesses the adaptability of detectors from three perspectives: Domain, LLM, and Attack. Unlike Task 1, this task emphasizes the detector's ability to handle out-of-distribution samples within each category. For example, we evaluate the performance of detectors trained on texts from one domain when applied to texts from different domains. The same approach is used to assess adaptability across different LLMs and attack types.

Task 3: Varying Text Length Assessment. This task evaluates how text length affects the performance of detectors, considering both training-time and test-time samples. During training-time, detectors are trained on specified length intervals and then tested on pivotal intervals. During test-time, detectors that were trained on pivotal intervals are tested with samples of varying lengths. This approach offers a comprehensive understanding of how text length influences detection capabilities.

Task 4: Real-World Human Writing Assessment. This task evaluates the impact of real-world human-written texts on the performance of detectors. As part of this innovative assessment, we simulate attacks on human-written texts to replicate the challenges these texts might face in real-world scenarios, such as spelling errors.

Comparison with Existing Benchmarks

✔: benchmark evaluates this scenario. △: has studies, not in evaluation. ⚪: similar scenario exist, but not fully based on real-world usage.

Previous and currently popular detection benchmarks, such as TuringBench (Uchendu et al., 2021), MGTBench (He et al., 2023), MULTITuDE (Macko et al., 2023), MAGE (Li et al., 2024) and M4 (Wang et al., 2024), have primarily focused on evaluating detectors’ performance across various domains, generative models, and languages by constructing idealized test data. However, they have overlooked the assessment of detectors’ capabilities in common scenarios encountered in practical applications (Wu et al., 2023)

Leaderboard

A higher average score indicates greater practical utility of the detectors. The leaderboard results reveal that supervised detectors consistently outperform zero-shot detectors, proving to be more effective and robust. Among the zero-shot detection methods, Log-Rank achieves the best performance, followed by LRR, Log-Likelihood, and Fast-DetectGPT. Additionally, our analysis highlights the unreliability of advanced detectors like DetectGPT and NPR when applied in real-world contexts.

Evaluation

Robustness Performance

Persistent Challenges in SOTA Detectors. These results highlight the challenges posed by our benchmark and explain why current SOTA detectors for LLM-generated text have not seen widespread adoption. Specifically, zero-shot detectors struggle against powerful LLMs, achieving an average AUROC of only 72.80% on texts generated with Direct Prompting, with no method exceeding a 90% AUROC. The performance of these detectors degrades substantially under meticulously designed attacks that mimic real-world scenarios, showing average decreases of 2.59% in Prompt Attacks, 17.50% in Paraphrase Attacks, 38.48% in Perturbation Attacks, and 20.29% in Data Mixing scenarios. In contrast, supervised methods demonstrate impressive effectiveness, achieving an average AUROC of 99.42% on data generated with Direct Prompting and maintaining robustness against all real-world scenarios.

Effectiveness of zero-shot detectors varies with the stylistic nature of domain data. Our results indicate that texts with a more formal style present greater challenges for detection. Detectors generally perform better with informal data, such as that from Social Media, but their effectiveness decreases markedly in more formal settings like News Writing. Interestingly, this decrease in performance is even more pronounced in advanced detectors, such as Fast-DetectGPT (Bao et al., 2023). Despite this variability, supervised classifiers demonstrate consistent reliability in detection across various domains. This finding aligns with insights provided by Li et al.(2023a), emphasizing the robustness of supervised classifiers in diverse textual environments.

Difference in statistical patterns of LLMs pose significant challenges to detector performance. Our experiments with various generative models revealed an intriguing phenomenon: nearly all zero-shot LLM-generated text detectors exhibit a marked decline in performance when processing texts generated by Claude-instant.410 This suggests that detector performance is influenced by the generative model employed and can degrade when faced with differing statistical patterns. The distinct decision thresholds for Claude-instant, compared to other models, further support this observation. We hypothesize that these differences arise from variations in the data, architecture, and methods of the models, though verifying this is difficult due to the opaque nature of black-box models. Moreover, supervised detectors are more susceptible to the impact of the generative model type than to the domain type, particularly in models with larger sizes. For example, Rob-Large achieved an AUROC score of nly 86.72% and an F1 score of 76.17% on texts generated by Llama-2-70b, whereas X-Rob-Large achieved 91.67% AUROC score and an F1 score of 82.24% on texts generated by Claude-instant.

Adversarial perturbation attacks represent a aignificant threat to zero-shot detectors. Our findings indicate that these attacks severely diminish the efficacy of zero-shot detectors, reducing their performance to an average AUROC of 34.32%, which is less than half compared to their performance under paraphrase attacks. Additionally, we observe that data mixing introduces a new challenging scenario, yielding performances comparable to those under paraphrase attacks, with detectors achieving an average 52.51% AUROC. Furthermore, while prompt attacks, such as few-shot prompts, can generate higher-quality text more aligned with human preferences, their impact on zero-shot detectors is minimal. However, refining LLM-generated texts via human-written prompts still challenges detectors, decreasing their effectiveness by an average of 9.15%. This finding suggests that prompt-based methods continue to hold potential for effectively compromising detector performance. In contrast, supervised detectors consistently maintain robust performance across various attack types, demonstrating their potential for practical applications.

Detailed Performance of Various Attacks

Generalization Performance

In the real world, there is a clear need for detectors that can effectively adapt to different types of text. In this paper, we further investigate this need, specifically focusing on the relationship between the distribution of training data and test data for detectors. We evaluated the generalization of representative detectors: LRR (Su et al., 2023), Fast-DetectGPT (Bao et al., 2023), and the RoB-Base Classifier (Park et al., 2021). Experimental results indicate that detectors trained on data with less formal styles, such as Creative Writing and Social Media, exhibit stronger generalization. Their comprehensive performance is 10% better than that of detectors trained on data from more formal styles, such as Academic Writing and News Writing. The variations in statistical patterns of generative models significantly impact the adaptability of detectors. Detectors trained on texts generated by models with similar statistical patterns, such as GPT-3.5-turbo, PaLM-2-bison, and Llama-2-70b, demonstrate better adaptability among each other, with the exception of texts generated by Claude-instant.

Detailed Performance of Various Attacks

Impact of Text Length

Shorter training samples for stronger detectors. We evaluated the performance of detectors trained on datasets with varying text lengths on the test set within the pivot length interval. The experimental results, revealed a golden length interval of 60-80 words, where texts consistently demonstrated strong detection performance across all detectors. However, as the length of the training texts increased, the performance of all zero-shot detectors gradually declined. This suggests that zero-shot detectors trained on shorter texts might be more effective than those trained on longer texts. In contrast, supervised detectors maintained consistent performance both within the golden length interval and longer text lengths.

Longer samples for better zero-shot detection. Similarly, we trained a detector using data from the pivotal length interval and evaluated its performance on test sets with varying text lengths. The experimental results, indicate that as the word length of the test texts increased, the performance of the zero-shot detectors gradually improved. This suggests a positive correlation between the performance of zero-shot detectors and the length of the texts being detected. In contrast, supervised methods exhibited a rapid increase in performance up to the pivotal length interval, followed by a slight decline.

Impact of Real-World Human Writing Scenarios

We investigated a critical question in real-world text detection: How do human-driven revisions or writing errors impact the performance of detectors? To simulate these real-world scenarios, we introduced paraphrasing attacks to mimic text revisions and incorporated spelling errors through adversarial perturbation attacks. Additionally, we mixed LLM-generated sentences with human-written content to simulate AI-assisted writing scenarios. Experimental results, indicate that attacks on human-written texts yield markedly different outcomes compared to those on LLM-generated texts. Specifically, paraphrasing attacks on human-written texts effectively confused zero-shot detectors, reducing the AUROC score by an average of 4.77%. In contrast, text mixing had a minimal impact on the performance of zero-shot detectors, with only a slight decline of 4.48% in AUROC. This is in stark contrast to the significant decline of 20.29% in AUROC when human-written texts were mixed with LLM-generated texts. The resilience of human-written texts to such mixing may be attributed to their inherent complexity, which likely makes it challenging for zero-shot detectors to discern the inclusion of LLM-generated content. Interestingly, perturbation attacks on human-written texts appeared to enhance the discernment capabilities of zero-shot detectors, resulting in an average increase of 11.05% in AUROC. Similar trends were observed with supervised detectors. This suggests that human-written texts may inherently contain more adversarial features, which are detected and utilized by the detectors for identification. Such adversarial perturbations can further emphasize these distinctions, leading to improved performance.

Statistics of DetectEval

Benchmark Statistics

Dataset Statistics

BibTeX


        @inproceedings{detecteval2024,
          author       = {Junchao Wu and
                          Runzhe Zhan and
                          Derek F. Wong and
                          Shu Yang and
                          Xinyi Yang and
                          Yulin Yuan and
                          Lidia S. Chao},
          title        = {DetectEval: Benchmarking LLM-Generated Text Detection in Real-World Scenarios},
          booktitle    = {Proceedings of the Neural Information Processing Systems Track on
                          Datasets and Benchmarks, NeurIPS Datasets and Benchmarks 2024},
          year         = {2024},
        }