Revisiting (and Rethinking) the Efficacy of Large Language Models for Security Assessments
Galli, D., Artioli, A., Andreolini, M., Stabili, D., Apruzzese, G., Marchetti, M., European Symposium on Research In Computer Security, 2026 Conference
Oneliner: We propose an evaluation framework for measuring the performance of LLMs in security assessments.
Abstract. Large Language Models (LLMs) are increasingly used in cybersecurity. According to prior research, LLMs can also ably solve tasks such as exploit generation and vulnerability identification. These results suggest that, perhaps, LLMs can indeed automate the process of security assessment in the real world. In this work, we challenge this hypothesis. We assert that the capabilities of LLMs for security assessments are still unclear, and we seek to shed a renewed light on this aspect.
First, through a systematic literature review, we factually support our fundamental assertion. Specifically, we observed that prior work define ambiguous (or not disclosed) prompts; conduct unbounded experiments; or rely on coarse-grained evaluation metrics. Altogether, these findings indicate the lack of a consistent evaluation protocol to measure the capabilities of LLMs in security assessments. Then, to fix this gap, we propose a standardized evaluation framework, aimed at researchers, that enables to better gauge the pros-and-cons of LLMs in security assessments. Next, we demonstrate the applicability of our framework by carrying out an \textit{extensive and fully reproducible experimental campaign}, encompassing 21 LLMs tested across 87 capture-the-flag challenges spanning 4 educational systems. Finally, we coalesce our empirical findings in three guidelines that can shape future work on LLMs for security assessments..
