Research Article

Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy

by  Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 27
Published: August 2025
Authors: Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe
10.5120/ijca2025925482
PDF

Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe . Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy. International Journal of Computer Applications. 187, 27 (August 2025), 38-43. DOI=10.5120/ijca2025925482

                        @article{ 10.5120/ijca2025925482,
                        author  = { Ritu Kuklani,Gururaj Shinde,Varad Vishwarupe },
                        title   = { Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 27 },
                        pages   = { 38-43 },
                        doi     = { 10.5120/ijca2025925482 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Ritu Kuklani
                        %A Gururaj Shinde
                        %A Varad Vishwarupe
                        %T Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 27
                        %P 38-43
                        %R 10.5120/ijca2025925482
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, or multimodal prompts to bypass guardrails. These attacks succeed by deceiving the model’s alignment layers trained via Reinforcement Learning from Human Feedback [10], [12], [20]. The paper proposes a comprehensive taxonomy that systematically categorizes RLHF limitations and also provide mitigation strategies for these attacks.

References
  • Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
  • Vishwarupe, V., Zahoor, S., Akhter, R., Bhatkar, V. P., Bedekar, M., Pande, M., Joshi, P. M., Patil, A., & Pawar, V. (2023). Designing a human-centered AI-based cognitive learning model for Industry 4.0 applications. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 84–95). Bentham Science Publishers.
  • Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
  • Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
  • Sayyed, H., Alwazae, M., & Vishwarupe, V. (2025). BlockSafe: Universal blockchain-based identity management. In Big Data in Finance (Vol. 169, pp. 101–118). Springer.
  • Vishwarupe, V., Maheshwari, S., Deshmukh, A., Mhaisalkar, S., Joshi, P. M., & Mathias, N. (2022). Bringing humans at the epicentre of artificial intelligence. Procedia Computer Science, 204, 914–921.
  • HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
  • HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
  • Vishwarupe, V., Bedekar, M., Pande, M., & Hiwale, A. (2018). Intelligent Twitter spam detection: A hybrid approach. In Smart trends in systems, security and sustainability (Vol. 18, pp. 157–167). Springer.
  • Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
  • Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
  • Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
  • Vishwarupe, V., Joshi, P. M., Mathias, N., Maheshwari, S., Mhaisalkar, S., & Pawar, V. (2022). Explainable AI and interpretable machine learning: A case study in perspective. Procedia Computer Science, 204, 869–876.
  • Wani, K., Khedekar, N., Vishwarupe, V., & Pushyanth, N. (2023). Digital twin and its applications. In Research Trends in Artificial Intelligence: Internet of Things (pp. 120–134). Bentham Science Publishers.
  • Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
  • Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P., Zahoor, S., & Kuklani, P. (2022). Comparative analysis of machine learning algorithms for analyzing NASA Kepler mission data. Procedia Computer Science, 204, 945–951.
  • Vishwarupe, V. (2022). Synthetic content generation using artificial intelligence. All Things Policy, IVM Podcasts.
  • Zahoor, S., Bedekar, M., Mane, V., & Vishwarupe, V. (2016). Uniqueness in user behavior while using the web. In Proceedings of the International Congress on Information and Communication Technology (Vol. 438, pp. 229–236). Springer.
  • Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  • Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
  • Vishwarupe, V., Bedekar, M., & Zahoor, S. (2015). Zone-specific weather monitoring system using crowdsourcing and telecom infrastructure. In 2015 International Conference on Information Processing (ICIP) (pp. 823–827). IEEE.
  • Zahoor, S., Bedekar, M., & Vishwarupe, V. (2016). A framework to infer webpage relevancy for a user. In Proceedings of First International Conference on ICT for Intelligent Systems (Vol. 50, pp. 173–181). Springer.
  • WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
  • Reddit – Prompt Engineering. (2024). Prompting an LLM to stop giving extra responses. https://www.reddit.com/r/PromptEngineering/comments/1h5367l/how_do_i_prompt_an_llm_to_stop_giving_me_extra/
  • Deoskar, V., Pande, M., & Vishwarupe, V. (2024). An analytical study for implementing 360-degree M-HRM practices using AI. In Intelligent Systems for Smart Cities (pp. 429–442). Springer.
  • Vishwarupe, V., et al. (2021). A zone-specific weather monitoring system. Australian Patent No. AU2021106275.
  • Reddit – Outlier AI. (2024). How to Create a Model Failure for Cypher RLHF. https://www.reddit.com/r/outlier_ai/comments/1hgoho7/how_to_create_a_model_failure_for_cypher_rlhf/
  • arXiv (2024a). Prompt Injection Mitigation for LLMs. arXiv preprint arXiv:2503.03039v1.
  • Vishwarupe, V., Bedekar, M., Joshi, P. M., Pande, M., Pawar, V., & Shingote, P. (2022). Data analytics in the game of cricket: A novel paradigm. Procedia Computer Science, 204, 937–944.
  • Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
  • Vishwarupe, V. V., & Joshi, P. M. (2016). Intellert: A novel approach for content-priority based message filtering. In IEEE Bombay Section Symposium (IBSS) (pp. 1–6). IEEE.
  • Vishwarupe, V., et al. (2025). Predicting mental health ailments using social media activities and keystroke dynamics with machine learning. In Big Data in Finance (Vol. 169, pp. 63–80). Springer.
  • Zahoor, S., Akhter, R., Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P. M., Pawar, V., Mandora, N., & Kuklani, P. (2023). A comprehensive study of state-of-the-art applications and challenges in IoT and blockchain technologies for Industry 4.0. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 1–16). Bentham.
  • NeurIPS 2024. (2024). Poster #96148. https://neurips.cc/virtual/2024/poster/96148
  • OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7
  • Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
  • Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
  • Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
  • HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
  • HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
  • Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
  • Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
  • Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
  • Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
  • Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  • Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
  • WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
  • Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
  • OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Reinforcement Learning from Human Feedback Indirect Multimodal Manipulations Large Language Models Semantic Jailbreaks

Powered by PhDFocusTM