International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 25 |
Published: July 2025 |
Authors: Amel Muminovic |
![]() |
Amel Muminovic . Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments. International Journal of Computer Applications. 187, 25 (July 2025), 1-9. DOI=10.5120/ijca2025925403
@article{ 10.5120/ijca2025925403, author = { Amel Muminovic }, title = { Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 25 }, pages = { 1-9 }, doi = { 10.5120/ijca2025925403 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Amel Muminovic %T Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments%T %J International Journal of Computer Applications %V 187 %N 25 %P 1-9 %R 10.5120/ijca2025925403 %I Foundation of Computer Science (FCS), NY, USA
As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three state-of-the-art large language models: OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5 080 YouTube comments drawn from high-abuse videos in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1 334 harmful and 3 746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with almost perfect agreement (Cohen’s κ = 0.83). Each model is evaluated in a strict zero-shot setting with an identical minimal prompt and deterministic decoding, giving a fair multi-language comparison without task-specific tuning. GPT-4.1 achieves the best balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flags the most harmful posts (recall = 0.875) but its precision falls to 0.767 because of frequent false positives. Claude attains the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall drops to 0.720. Qualitative analysis shows that all three models struggle with sarcasm, coded insults, and mixed-language slang. The findings highlight the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset, along with the prompts and model outputs, has been made available to support reproducibility and further progress in automated content moderation.