Enhancing Data Authenticity: Leveraging Humanities Annotation Practices for NLP

Urmishree Bedamatta

Research Article

Enhancing Data Authenticity: Leveraging Humanities Annotation Practices for NLP

by Urmishree Bedamatta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Issue 62

Published: January 2025

Authors: Urmishree Bedamatta

10.5120/ijca2025924461

PDF

Urmishree Bedamatta . Enhancing Data Authenticity: Leveraging Humanities Annotation Practices for NLP. International Journal of Computer Applications. 186, 62 (January 2025), 34-37. DOI=10.5120/ijca2025924461

                        @article{ 10.5120/ijca2025924461,
                        author  = { Urmishree Bedamatta },
                        title   = { Enhancing Data Authenticity: Leveraging Humanities Annotation Practices for NLP },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 186 },
                        number  = { 62 },
                        pages   = { 34-37 },
                        doi     = { 10.5120/ijca2025924461 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Urmishree Bedamatta
                        %T Enhancing Data Authenticity: Leveraging Humanities Annotation Practices for NLP%T 
                        %J International Journal of Computer Applications
                        %V 186
                        %N 62
                        %P 34-37
                        %R 10.5120/ijca2025924461
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper explores the potential of applying textual criticism practices, traditionally a core aspect of humanities research, to enhance the authenticity and interpretability of linguistic data for Natural Language Processing (NLP) applications. By proposing a multi-layered annotation model, this work argues that annotations extending beyond syntactic and semantic labels, encompassing historical, cultural, and rhetorical contexts, can provide NLP systems with a deeper, context-aware understanding of language. Drawing on examples from the digital edition of the Odia Mahabharata, the paper illustrates how annotations that capture word evolution, cultural nuances, and stylistic choices can mitigate challenges in transcription, while preserving the authenticity of texts. The paper further demonstrates how such annotation practices enable NLP systems to address linguistic subtleties such as ambiguity, irony, and sentiment, making them more effective for complex tasks like machine translation, sentiment analysis, and content generation. Ultimately, this study argues that integrating humanities-driven annotation practices into NLP can not only improve the quality of computational models but also ensure the preservation and accessibility of culturally and historically significant language forms.

References

Bender, E. M. (2019). The #Bender Rule: On Naming the Languages We Study and the Languages We Use. ACL 2019.
Bird, S., Klein, E. & Loper, M. 2009. Natural Language Processing with Python. O’Reilly Media.
Bird, S., & Liberman, M. 2001. A Formal Framework for Linguistic Annotation. Speech Communication, 33(1-2), 23-60.
Blodgett, S. L., Barocas, S., Dastin, J., & Wallach, H. 2020. Language (technology) is Power: A Critical Survey of “Bias” in NLP. ACL 2020.
Charniak, E. 1993. Statistical Language Learning. MIT Press.
Ide, N., & Pustejovsky, J. 2017. Handbook of Linguistic Annotation. Springer.
Kress, G., van Leeuwen, T. 2001. Multimodal Discourse: The Modes and Media of Contemporary Communication. Edward Arnold.
Labov, W. 1972. Sociolinguistic Patterns. University of Pennsylvania Press.
Muller, T. 2016. Digital Humanities and Computational Linguistics: Exploring the Potential of Annotated Corpora. Language Resources and Evaluation.
Tufekci, Z. 2014. Big Questions for Social Media Big Data: Representations and Biases in the Big Data Paradigm. Proceedings of the 2014 ACM Conference on Web Science.

Index Terms

Computer Science

Information Sciences

Natural Language Processing

Keywords

Textual criticism Multi-layered annotation Odia Mahabharata Natural language processing Digital humanities