Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Saptarshi Sengupta

Second Advisor

Sayma Akther

Third Advisor

Gopal Nath

Keywords

Large Language Models, Adversarial Training, Jailbreak Attacks, Prompt-level Attacks, Curriculum Learning, LoRA

Abstract

Large language Models (LLMs) are used in many AI tools people rely on today, and they can be tricked into ignoring their safety rules through jailbreak prompts. These jailbreaks happen in multiple ways, mainly in two ways: one involves tweaking the exact wording or token sequence to push the model into unsafe behavior, and the other uses clever phrasing and indirect instructions to slip past the LLM’s content filters. The existing gap is that the defenses focus on only one of these tricks, leaving the model vulnerable to the others. To address both problems, we propose a hierarchical two-stage adversarial training framework that systematically addresses both vulnerability classes. The first stage addresses token-level or word-level robustness through adversarial training with curriculum learning, applying gradient-based perturbations to fortify models against attacks such as Greedy Coordinate Gradient (GCG). The second stage extends protection to semantic-level threats via Adaptive Adversarial Prompt Refinement (AAPR) that synthesizes diverse semantic adversarial training samples. Attack Success Rate comes down for all the models when the model is added with the two-stage defense, compared to the stage before adding any defense. Comparative evaluation against established baselines (Self-Reminder, Unlearning, SmoothLLM) demonstrates that our hierarchical approach surpasses all alternatives across attack categories. The framework achieves effective defense while requiring LoRA fine-tuning on 0.1% of the model parameters.

Available for download on Saturday, December 19, 2026

Share

COinS