Master's Projects

A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks

Revanth Reddy Shada

Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Saptarshi Sengupta

Second Advisor

Sayma Akther

Third Advisor

Gopal Nath

Keywords

Large Language Models, Adversarial Training, Jailbreak Attacks, Prompt-level Attacks, Curriculum Learning, LoRA

Abstract

Large language Models (LLMs) are used in many AI tools people rely on today, and they can be tricked into ignoring their safety rules through jailbreak prompts. These jailbreaks happen in multiple ways, mainly in two ways: one involves tweaking the exact wording or token sequence to push the model into unsafe behavior, and the other uses clever phrasing and indirect instructions to slip past the LLM’s content filters. The existing gap is that the defenses focus on only one of these tricks, leaving the model vulnerable to the others. To address both problems, we propose a hierarchical two-stage adversarial training framework that systematically addresses both vulnerability classes. The first stage addresses token-level or word-level robustness through adversarial training with curriculum learning, applying gradient-based perturbations to fortify models against attacks such as Greedy Coordinate Gradient (GCG). The second stage extends protection to semantic-level threats via Adaptive Adversarial Prompt Refinement (AAPR) that synthesizes diverse semantic adversarial training samples. Attack Success Rate comes down for all the models when the model is added with the two-stage defense, compared to the stage before adding any defense. Comparative evaluation against established baselines (Self-Reminder, Unlearning, SmoothLLM) demonstrates that our hierarchical approach surpasses all alternatives across attack categories. The framework achieves effective defense while requiring LoRA fine-tuning on 0.1% of the model parameters.

Recommended Citation

Reddy Shada, Revanth, "A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks" (2025). Master's Projects. 1601.
DOI: https://doi.org/10.31979/etd.65bj-sgrr
https://scholarworks.sjsu.edu/etd_projects/1601

Download

Available for download on Saturday, December 19, 2026

Included in

Computer Sciences Commons

COinS

DOI

https://doi.org/10.31979/etd.65bj-sgrr

Master's Projects

A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

A Hierarchical Two-Stage Adversarial Training Framework for Defending LLMs Against Jailbreak Attacks

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links