|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 25 |
| Published: July 2025 |
| Authors: Amel Muminovic |
10.5120/ijca2025925403
|
Amel Muminovic . Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments. International Journal of Computer Applications. 187, 25 (July 2025), 1-9. DOI=10.5120/ijca2025925403
@article{ 10.5120/ijca2025925403,
author = { Amel Muminovic },
title = { Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments },
journal = { International Journal of Computer Applications },
year = { 2025 },
volume = { 187 },
number = { 25 },
pages = { 1-9 },
doi = { 10.5120/ijca2025925403 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2025
%A Amel Muminovic
%T Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments%T
%J International Journal of Computer Applications
%V 187
%N 25
%P 1-9
%R 10.5120/ijca2025925403
%I Foundation of Computer Science (FCS), NY, USA
As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three state-of-the-art large language models: OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5 080 YouTube comments drawn from high-abuse videos in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1 334 harmful and 3 746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with almost perfect agreement (Cohen’s κ = 0.83). Each model is evaluated in a strict zero-shot setting with an identical minimal prompt and deterministic decoding, giving a fair multi-language comparison without task-specific tuning. GPT-4.1 achieves the best balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flags the most harmful posts (recall = 0.875) but its precision falls to 0.767 because of frequent false positives. Claude attains the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall drops to 0.720. Qualitative analysis shows that all three models struggle with sarcasm, coded insults, and mixed-language slang. The findings highlight the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset, along with the prompts and model outputs, has been made available to support reproducibility and further progress in automated content moderation.