Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

Amel Muminovic

Research Article

Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

by Amel Muminovic

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 25

Published: July 2025

Authors: Amel Muminovic

10.5120/ijca2025925403

PDF

Amel Muminovic . Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments. International Journal of Computer Applications. 187, 25 (July 2025), 1-9. DOI=10.5120/ijca2025925403

                        @article{ 10.5120/ijca2025925403,
                        author  = { Amel Muminovic },
                        title   = { Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 25 },
                        pages   = { 1-9 },
                        doi     = { 10.5120/ijca2025925403 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Amel Muminovic
                        %T Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 25
                        %P 1-9
                        %R 10.5120/ijca2025925403
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three state-of-the-art large language models: OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5 080 YouTube comments drawn from high-abuse videos in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1 334 harmful and 3 746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with almost perfect agreement (Cohen’s κ = 0.83). Each model is evaluated in a strict zero-shot setting with an identical minimal prompt and deterministic decoding, giving a fair multi-language comparison without task-specific tuning. GPT-4.1 achieves the best balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flags the most harmful posts (recall = 0.875) but its precision falls to 0.767 because of frequent false positives. Claude attains the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall drops to 0.720. Qualitative analysis shows that all three models struggle with sarcasm, coded insults, and mixed-language slang. The findings highlight the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset, along with the prompts and model outputs, has been made available to support reproducibility and further progress in automated content moderation.

References

B. Dean, “Social media usage & growth statistics,” Backlinko, Feb. 21, 2024. [Online]. Available: https://backlinko.com/social-media-users
A. B. Barrag´an Mart´ın et al., “Study of cyberbullying among adolescents in recent years: A bibliometric analysis,” Int. J. Environ. Res. Public Health, vol. 18, no. 6, p. 3016, Mar. 2021. doi:10.3390/ijerph18063016
S. Hinduja and J. W. Patchin, “Bullying, cyberbullying, and suicide,” Arch. Suicide Res., vol. 14, no. 3, pp. 206–221, 2010. doi:10.1080/13811118.2010.494133
C. P. Barlett, “Anonymously hurting others online: The effect of anonymity on cyberbullying frequency,” Psychol. Pop. Media Cult., vol. 4, no. 2, pp. 70–79, 2015. doi:10.1037/a0034335
L. Huang et al., “The severity of cyberbullying affects bystander intervention among college students: The roles of feelings of responsibility and empathy,” Psychol. Res. Behav. Manag., vol. 16, pp. 893–903, Mar. 2023. doi:10.2147/PRBM.S397770
A. Vigderman, “Cyberbullying: Twenty crucial statistics for 2024,” Security.org, Oct. 9, 2024. [Online]. Available: https://www.security.org/resources/cyberbullying-factsstatistics
W. Craig et al., “Social media use and cyber-bullying: A cross-national analysis of young people in 42 countries,” J. Adolesc. Health, vol. 66, no. 6, pp. S100–S108, Jun. 2020. doi:10.1016/j.jadohealth.2020.03.006
M. H. Ribeiro, J. Cheng, and R. West, “Automated content moderation increases adherence to community guidelines,” in Proc. ACM Web Conf. (WWW), 2023, pp. 2666–2676. doi:10.1145/3543507.3583275
S. Wang and K. J. Kim, “Content moderation on social media: Does it matter who and why moderates hate speech?” Cyberpsychol. Behav. Soc. Netw., vol. 26, no. 7, pp. 527–534, Jul. 2023. doi:10.1089/cyber.2022.0158
T. Gillespie, “Content moderation, AI, and the question of scale,” Big Data Soc., vol. 7, no. 2, pp. 1–5, Jul. 2020. doi:10.1177/2053951720943234
H. Lopez and S. K¨ubler, “Context in abusive language detection: On the interdependence of context and annotation of user comments,” Discourse, Context Media, vol. 63, Art. no. 100848, Feb. 2025. doi:10.1016/j.dcm.2024.100848
M. van Geel, P. Vedder, and J. Tanilon, “Relationship between peer victimization, cyberbullying, and suicide in children and adolescents: A meta-analysis,” JAMA Pediatr., vol. 168, no. 5, pp. 435–442, May 2014. doi:10.1001/jamapediatrics.2013.4143
Z. Waseem and D. Hovy, “Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter,” in Proc. NAACL Student Res. Workshop, San Diego, CA, USA, Jun. 2016, pp. 88–93. doi:10.18653/v1/N16-2013
A.-M. Founta et al., “Large scale crowdsourcing and characterization of Twitter abusive behavior,” in Proc. Int. Conf. Web Social Media, Atlanta, GA, USA, Mar. 2018, pp. 491–500. doi:10.1609/icwsm.v12i1.14991
M. Zampieri et al., “Predicting the type and target of offensive posts in social media,” in Proc. NAACL, Minneapolis, MN, USA, Jun. 2019, pp. 1415–1420. doi:10.18653/v1/N19-1144
P. R¨ottger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert, “HateCheck: Functional tests for hate speech detection models,” in Proc. 59th Annu. Meet. Assoc. Comput. Linguistics & 11th Int. Joint Conf. NLP (Long Papers), Online, Aug. 2021, pp. 41–58. doi:10.18653/v1/2021.acl-long.4
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423
Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, Jul. 2019. [Online]. Available: https://arxiv.org/abs/1907.11692
B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee, “HateXplain: A benchmark dataset for explainable hate speech detection,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 17, May 2021, pp. 14867–14875. doi:10.1609/aaai.v35i17.17745
B. Vidgen, T. Thrush, Z. Waseem, and D. Kiela, “Learning from the worst: Dynamically generated datasets to improve online hate detection,” in Proc. 59th Annu. Meet. Assoc. Comput. Linguistics & 11th Int. Joint Conf. NLP (Long Papers), Aug. 2021, pp. 1667–1682. doi:10.18653/v1/2021.acl-long.132
S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings Assoc. Comput. Linguistics: EMNLP 2020, Nov. 2020, pp. 3356–3369. doi:10.18653/v1/2020.findings-emnlp.301
A. Arora, “Sarcasm detection in social media: A review,” in Proc. Int. Conf. Innov. Comput. Commun. (ICICC), Dec. 2020, pp. 1–4. doi:10.2139/ssrn.3749018
M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing,” Neurocomputing, vol. 546, Art. no. 126232, Aug. 2023. doi:10.1016/j.neucom.2023.126232
J. M. P´erez et al., “Assessing the impact of contextual information in hate speech detection,” IEEE Access, vol. 11, pp. 30575–30590, 2023. doi:10.1109/ACCESS.2023.3258973
A. Muminovic and A. K. Muminovic, “Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages,” arXiv preprint arXiv:2506.09992, Jun. 2025. [Online]. Available: https://arxiv.org/abs/2506.09992
H. Mubarak, K. Darwish, and W. Magdy, “Abusive language detection on Arabic social media,” in Proc. 1st Workshop on Abusive Language Online, Vancouver, Canada, 2017, pp. 52–56. doi:10.18653/v1/W17-3008
T. Mandl et al., “Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo- European languages,” in Proc. FIRE, Kolkata, India, 2019, pp. 14–17. doi:10.1145/3368567.3368584
T. Gr¨ondahl, L. Pajola, M. Juuti, M. Conti, and N. Asokan, “‘All you need is Love’: Evading hate speech detection,” in Proc. 11th ACM Workshop Artif. Intell. Security, Toronto, Canada, 2018, pp. 2–12. doi:10.1145/3270101.3270103
N. Murikinati, A. Anastasopoulos, and G. Neubig, “Transliteration for cross-lingual morphological inflection,” in Proc. 17th SIGMORPHON Workshop Computational Research Phonetics, Phonology, and Morphology, Online, Jul. 2020, pp. 189–197. doi:10.18653/v1/2020.sigmorphon-1.22
J. Khanuja, A. Dandapat, A. Srinivasan, S. Sitaram, and M. Choudhury, “GLUECoS: An evaluation benchmark for codeswitched NLP,” in Proc. ACL-IJCNLP, Bangkok, Thailand, 2021, pp. 3575–3585. doi:10.18653/v1/2020.acl-main.329
J. Ranasinghe and M. Zampieri, “Multilingual offensive language identification with cross-lingual embeddings,” in Proc. EMNLP, Online, 2020, pp. 5838–5844. doi:10.18653/v1/2020.emnlp-main.470
C¸ . C¸ ¨oltekin, “A corpus of Turkish offensive language on social media,” in Proc. LREC, Marseille, France, 2022, pp. 4878–4885. [Online]. Available: https://aclanthology.org/2020.lrec-1.758
E. Pamungkas and V. Patti, “Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon,” in Proc. ACL, Florence, Italy, 2019, pp. 363–370. doi:10.18653/v1/P19-1051
Y. Liu and M. Zhang, “LLM-Mod: Can Large Language Models Assist Content Moderation?” in Proc. ACM Conf. Fairness, Accountability, and Transparency (FAccT), Rio de Janeiro, Brazil, 2024, pp. 1–12. doi:10.1145/3613905.3650828
F. M. Plaza-Del-Arco, D. Nozza, and D. Hovy, “Respectful or toxic? Using zero-shot learning with language models to detect hate speech,” in Proc. 7th Workshop Online Abuse and Harms (WOAH), Singapore, Jan. 2023, pp. 46–52. doi:10.18653/v1/2023.woah-1.6
J. Pavlopoulos et al., “Toxicity detection: Does context really matter?” in Proc. ACL, Online, 2020, pp. 4296–4305. doi:10.18653/v1/2020.acl-main.396
A. Baheti, M. Sap, and Y. Tsvetkov, “Just say no: Analyzing the stance of neural dialogue generation in offensive contexts,” in Proc. EMNLP, Online, 2021, pp. 4846–4859. doi:10.18653/v1/2021.emnlp-main.397
M. Sap et al., “Social bias frames: Reasoning about social and power implications of language,” in Proc. ACL, Online, 2020, pp. 5477–5490. doi:10.18653/v1/2020.acl-main.486
I. Solaiman and C. Dennison, “Process for adapting language models to society (PALMS),” Tech. Rep., OpenAI, 2021. [Online]. Available: https://arxiv.org/abs/2106.10328
T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” in Proc. NeurIPS, Barcelona, Spain, 2016, pp. 4356–4364. doi:10.48550/arXiv.1607.06520
H. Welbl, A. Stiennon, and Y. Bai, “Challenges in detoxifying language models,” Tech. Rep., DeepMind, 2021. [Online]. Available: https://arxiv.org/abs/2109.07445
R. Hartvigsen, H. Palangi, and X. He, “Toxigen: Controllable generation of implicit and adversarial toxic text,” in Proc. ACL, Dublin, Ireland, 2022, pp. 524–535. doi:10.18653/v1/2022.acllong. 39
E. Bender et al., “On the dangers of stochastic parrots,” in Proc. FAccT, Online, 2021, pp. 610–623. doi:10.1145/3442188.344592

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Artificial intelligence Cyberbullying Hate Speech Large Language Models Natural Language Processing Social Media