International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 27 |
Published: August 2025 |
Authors: Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe |
![]() |
Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe . Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy. International Journal of Computer Applications. 187, 27 (August 2025), 38-43. DOI=10.5120/ijca2025925482
@article{ 10.5120/ijca2025925482, author = { Ritu Kuklani,Gururaj Shinde,Varad Vishwarupe }, title = { Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 27 }, pages = { 38-43 }, doi = { 10.5120/ijca2025925482 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Ritu Kuklani %A Gururaj Shinde %A Varad Vishwarupe %T Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy%T %J International Journal of Computer Applications %V 187 %N 27 %P 38-43 %R 10.5120/ijca2025925482 %I Foundation of Computer Science (FCS), NY, USA
In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, or multimodal prompts to bypass guardrails. These attacks succeed by deceiving the model’s alignment layers trained via Reinforcement Learning from Human Feedback [10], [12], [20]. The paper proposes a comprehensive taxonomy that systematically categorizes RLHF limitations and also provide mitigation strategies for these attacks.