A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks
Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Saptarshi Sengupta
Second Advisor
Sayma Akther
Third Advisor
Gopal Nath
Keywords
Large Language Models, Adversarial Training, Jailbreak Attacks, Prompt-level Attacks, Curriculum Learning, LoRA
Abstract
Large language Models (LLMs) are used in many AI tools people rely on today, and they can be tricked into ignoring their safety rules through jailbreak prompts. These jailbreaks happen in multiple ways, mainly in two ways: one involves tweaking the exact wording or token sequence to push the model into unsafe behavior, and the other uses clever phrasing and indirect instructions to slip past the LLM’s content filters. The existing gap is that the defenses focus on only one of these tricks, leaving the model vulnerable to the others. To address both problems, we propose a hierarchical two-stage adversarial training framework that systematically addresses both vulnerability classes. The first stage addresses token-level or word-level robustness through adversarial training with curriculum learning, applying gradient-based perturbations to fortify models against attacks such as Greedy Coordinate Gradient (GCG). The second stage extends protection to semantic-level threats via Adaptive Adversarial Prompt Refinement (AAPR) that synthesizes diverse semantic adversarial training samples. Attack Success Rate comes down for all the models when the model is added with the two-stage defense, compared to the stage before adding any defense. Comparative evaluation against established baselines (Self-Reminder, Unlearning, SmoothLLM) demonstrates that our hierarchical approach surpasses all alternatives across attack categories. The framework achieves effective defense while requiring LoRA fine-tuning on 0.1% of the model parameters.
Recommended Citation
Reddy Shada, Revanth, "A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks" (2025). Master's Projects. 1601.
DOI: https://doi.org/10.31979/etd.65bj-sgrr
https://scholarworks.sjsu.edu/etd_projects/1601