Hostile Content Detection from Tweets in Hindi using Machine Learning and Deep Learning

Datla Tarun Anjaneya Varma; Nukala Sai Dhanuj; Nookala Gopala Krishna Murthy

Research Article

Hostile Content Detection from Tweets in Hindi using Machine Learning and Deep Learning

by Datla Tarun Anjaneya Varma, Nukala Sai Dhanuj, Nookala Gopala Krishna Murthy

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Issue 11

Published: March 2024

Authors: Datla Tarun Anjaneya Varma, Nukala Sai Dhanuj, Nookala Gopala Krishna Murthy

10.5120/ijca2024923466

PDF

Datla Tarun Anjaneya Varma, Nukala Sai Dhanuj, Nookala Gopala Krishna Murthy . Hostile Content Detection from Tweets in Hindi using Machine Learning and Deep Learning. International Journal of Computer Applications. 186, 11 (March 2024), 30-34. DOI=10.5120/ijca2024923466

                        @article{ 10.5120/ijca2024923466,
                        author  = { Datla Tarun Anjaneya Varma,Nukala Sai Dhanuj,Nookala Gopala Krishna Murthy },
                        title   = { Hostile Content Detection from Tweets in Hindi using Machine Learning and Deep Learning },
                        journal = { International Journal of Computer Applications },
                        year    = { 2024 },
                        volume  = { 186 },
                        number  = { 11 },
                        pages   = { 30-34 },
                        doi     = { 10.5120/ijca2024923466 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2024
                        %A Datla Tarun Anjaneya Varma
                        %A Nukala Sai Dhanuj
                        %A Nookala Gopala Krishna Murthy
                        %T Hostile Content Detection from Tweets in Hindi using Machine Learning and Deep Learning%T 
                        %J International Journal of Computer Applications
                        %V 186
                        %N 11
                        %P 30-34
                        %R 10.5120/ijca2024923466
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

In this paper, the focus is to address the exigent challenge of cyberbullying detection within the domain of Hindi social media discourse, an area conspicuously underserved in scholarly exploration. Harnessing a meticulously curated dataset from the CONSTRAINT-2021[1][6] shared task, encompassing approximately 8,200 posts meticulously annotated with categories delineating facets such as fake, hate, offensive, and defamation, the study leverages the prowess of machine learning methodologies. Two distinct approaches are scrutinized: one predicated on the application of the MBERT transformer model, involving the translation of sentences into English, and the other leveraging INLTK embeddings directly for Hindi posts. The outcomes unveil the superior efficacy of the MBERT model in comparison to INLTK. Employing discerning algorithms such as Xgboost, Lightgbm, and Catboost, the research attains commendable F1 scores across diverse categories of hostile content. This scholarly pursuit thus not only enriches the existing literature on the detection of cyberbullying in regional languages but also furnishes consequential insights for mitigating this societal challenge.

References

2021. CONSTRAINTS-2021: Shared tasks on Hostile Posts Detection.
Singh, D., Singh, R., & Kaur, R. (2021). Cyberbullying detection in Hindi tweets using machine learning. Journal of Ambient Intelligence and Humanized Computing, 12(8), 8347-8360.
Gautam, S., Tiwari, S., & Singh, M. P. (2020). Cyberbullying detection in Hindi text using machine learning techniques. International Journal of Computer Science and Mobile Computing, 9(6), 288-295.
Sharma, A., Singh, P., & Gupta, N. (2021). Cyberbullying detection in Hindi using convolutional neural networks and recurrent neural networks. In 2021 4th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1-6). IEEE.
Waseem, Z.; and Hovy, D. 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In SRW@HLT-NAACL.
Mohit Bharadwaj, Md Shad Akhtar, Asif Ekbal, Amitava Das, Tanmoy Chakraborty. Hostility Detection Dataset in Hindi. arXiv:2011.03588.
Wijesiriwardena, C., Herath, T., Shanthikumar, S., & Fernando, S. (2020). A survey on detection and analysis of online harassment on social media. Journal of Ambient Intelligence and Humanized Computing, 11(6), 2585-2600.
Haddad, H., Elsayed, T., Torki, M., & Eldesouki, M. (2020). Multilingual hate speech detection in Arabic and English: A comparative study. Computers in Human Behavior, 112, 106480.
Hossain, M. A., Uddin, M. S., & Islam, M. S. (2020). Bengali cyberbullying detection in social media using machine learning approach. Journal of Ambient Intelligence and Humanized Computing, 11(10), 4225-4236.
Jha, V.; Poroli, H.; N, V.; Vijayan, V.; and P, P. 2020. DHOTRepository and Classification of Offensive Tweets in the Hindi Language. Procedia Computer Science 171: 2324–2333. doi:10.1016/j.procs.2020.04.252.
Safi Samghabadi, N.; Patwa, P.; PYKL, S.; Mukherjee, P.; Das, A.; and Solorio, T. 2020. Aggression and Misogyny Detection using BERT: A Multi-Task Approach. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, 126–131. Marseille, France: European Language Resources Association (ELRA). ISBN 979-10-95546-56-6. URL https://www.aclweb.org/anthology/2020. Trac-1.20.
Safi Samghabadi, A., Shahbazian, R., & Tahir, A. (2020). Aggressiveness and misogyny detection in English, Bengali and Hindi languages. Multimedia Tools and Applications, 79(43), 32487-32507.
Bohra, N., Mitra, T., & Biemann, C. (2018). Analysis of abusive language detection in Hindi-English code mixed data. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018) (pp. 54-63).
Mathur, P., Agarwal, S., & Kumaraguru, P. (2018). Detecting inflammatory posts in Hindi-English code-mixed social media content. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) (pp. 2153-2157).
Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. In ICWSM.
Jha, R. K., Kumar, P., & Sinha, R. (2020). Swear words based objectionable text identification in Hindi. In 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (pp. 17-20). IEEE.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Hostile content detection Cyberbullying Machine learning Deep learning Hindi tweets MBERT embeddings INLTK embeddings Catboost F1 score