Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy

Ritu Kuklani; Gururaj Shinde; Varad Vishwarupe

Research Article

Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy

by Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 27

Published: August 2025

Authors: Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe

10.5120/ijca2025925482

PDF

Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe . Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy. International Journal of Computer Applications. 187, 27 (August 2025), 38-43. DOI=10.5120/ijca2025925482

                        @article{ 10.5120/ijca2025925482,
                        author  = { Ritu Kuklani,Gururaj Shinde,Varad Vishwarupe },
                        title   = { Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 27 },
                        pages   = { 38-43 },
                        doi     = { 10.5120/ijca2025925482 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Ritu Kuklani
                        %A Gururaj Shinde
                        %A Varad Vishwarupe
                        %T Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 27
                        %P 38-43
                        %R 10.5120/ijca2025925482
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, or multimodal prompts to bypass guardrails. These attacks succeed by deceiving the model’s alignment layers trained via Reinforcement Learning from Human Feedback [10], [12], [20]. The paper proposes a comprehensive taxonomy that systematically categorizes RLHF limitations and also provide mitigation strategies for these attacks.

References

Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
Vishwarupe, V., Zahoor, S., Akhter, R., Bhatkar, V. P., Bedekar, M., Pande, M., Joshi, P. M., Patil, A., & Pawar, V. (2023). Designing a human-centered AI-based cognitive learning model for Industry 4.0 applications. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 84–95). Bentham Science Publishers.
Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
Sayyed, H., Alwazae, M., & Vishwarupe, V. (2025). BlockSafe: Universal blockchain-based identity management. In Big Data in Finance (Vol. 169, pp. 101–118). Springer.
Vishwarupe, V., Maheshwari, S., Deshmukh, A., Mhaisalkar, S., Joshi, P. M., & Mathias, N. (2022). Bringing humans at the epicentre of artificial intelligence. Procedia Computer Science, 204, 914–921.
HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
Vishwarupe, V., Bedekar, M., Pande, M., & Hiwale, A. (2018). Intelligent Twitter spam detection: A hybrid approach. In Smart trends in systems, security and sustainability (Vol. 18, pp. 157–167). Springer.
Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
Vishwarupe, V., Joshi, P. M., Mathias, N., Maheshwari, S., Mhaisalkar, S., & Pawar, V. (2022). Explainable AI and interpretable machine learning: A case study in perspective. Procedia Computer Science, 204, 869–876.
Wani, K., Khedekar, N., Vishwarupe, V., & Pushyanth, N. (2023). Digital twin and its applications. In Research Trends in Artificial Intelligence: Internet of Things (pp. 120–134). Bentham Science Publishers.
Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P., Zahoor, S., & Kuklani, P. (2022). Comparative analysis of machine learning algorithms for analyzing NASA Kepler mission data. Procedia Computer Science, 204, 945–951.
Vishwarupe, V. (2022). Synthetic content generation using artificial intelligence. All Things Policy, IVM Podcasts.
Zahoor, S., Bedekar, M., Mane, V., & Vishwarupe, V. (2016). Uniqueness in user behavior while using the web. In Proceedings of the International Congress on Information and Communication Technology (Vol. 438, pp. 229–236). Springer.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
Vishwarupe, V., Bedekar, M., & Zahoor, S. (2015). Zone-specific weather monitoring system using crowdsourcing and telecom infrastructure. In 2015 International Conference on Information Processing (ICIP) (pp. 823–827). IEEE.
Zahoor, S., Bedekar, M., & Vishwarupe, V. (2016). A framework to infer webpage relevancy for a user. In Proceedings of First International Conference on ICT for Intelligent Systems (Vol. 50, pp. 173–181). Springer.
WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
Reddit – Prompt Engineering. (2024). Prompting an LLM to stop giving extra responses. https://www.reddit.com/r/PromptEngineering/comments/1h5367l/how_do_i_prompt_an_llm_to_stop_giving_me_extra/
Deoskar, V., Pande, M., & Vishwarupe, V. (2024). An analytical study for implementing 360-degree M-HRM practices using AI. In Intelligent Systems for Smart Cities (pp. 429–442). Springer.
Vishwarupe, V., et al. (2021). A zone-specific weather monitoring system. Australian Patent No. AU2021106275.
Reddit – Outlier AI. (2024). How to Create a Model Failure for Cypher RLHF. https://www.reddit.com/r/outlier_ai/comments/1hgoho7/how_to_create_a_model_failure_for_cypher_rlhf/
arXiv (2024a). Prompt Injection Mitigation for LLMs. arXiv preprint arXiv:2503.03039v1.
Vishwarupe, V., Bedekar, M., Joshi, P. M., Pande, M., Pawar, V., & Shingote, P. (2022). Data analytics in the game of cricket: A novel paradigm. Procedia Computer Science, 204, 937–944.
Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
Vishwarupe, V. V., & Joshi, P. M. (2016). Intellert: A novel approach for content-priority based message filtering. In IEEE Bombay Section Symposium (IBSS) (pp. 1–6). IEEE.
Vishwarupe, V., et al. (2025). Predicting mental health ailments using social media activities and keystroke dynamics with machine learning. In Big Data in Finance (Vol. 169, pp. 63–80). Springer.
Zahoor, S., Akhter, R., Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P. M., Pawar, V., Mandora, N., & Kuklani, P. (2023). A comprehensive study of state-of-the-art applications and challenges in IoT and blockchain technologies for Industry 4.0. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 1–16). Bentham.
NeurIPS 2024. (2024). Poster #96148. https://neurips.cc/virtual/2024/poster/96148
OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7
Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Reinforcement Learning from Human Feedback Indirect Multimodal Manipulations Large Language Models Semantic Jailbreaks