Research Article

A POS-Tagged Corpus for Dogri: Development and Annotation Using DogriTag

by  Vipul Saluja, Jyotshna Dongardive
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 59
Published: November 2025
Authors: Vipul Saluja, Jyotshna Dongardive
10.5120/ijca2025926001
PDF

Vipul Saluja, Jyotshna Dongardive . A POS-Tagged Corpus for Dogri: Development and Annotation Using DogriTag. International Journal of Computer Applications. 187, 59 (November 2025), 36-43. DOI=10.5120/ijca2025926001

                        @article{ 10.5120/ijca2025926001,
                        author  = { Vipul Saluja,Jyotshna Dongardive },
                        title   = { A POS-Tagged Corpus for Dogri: Development and Annotation Using DogriTag },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 59 },
                        pages   = { 36-43 },
                        doi     = { 10.5120/ijca2025926001 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Vipul Saluja
                        %A Jyotshna Dongardive
                        %T A POS-Tagged Corpus for Dogri: Development and Annotation Using DogriTag%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 59
                        %P 36-43
                        %R 10.5120/ijca2025926001
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper discusses about the process used to create a linguistically selected and manually annotated Part of Speech (POS) tagged corpus for Dogri. Dogri is a low resource and Indo Aryan language that is spoken in the Indian Union Territory of Jammu and Kashmir and in some regions of Pakistan. Dogri is poorly represented in Natural Language Processing (NLP) despite the sufficient number of speakers and the official recognition. This is due to the absence of resources such as defined tag sets, annotated corpora and annotation specific tools. To fill this gap, a POS tagged Dogri corpus was developed from a domain-specific subset of the Linguistic Data Consortium for Indian Languages (LDC-IL). This corpus has about 25,000 sentences (approximately 400,000 tokens). A specialized web platform named DogriTag was developed that can track audits and make semi-automated tag suggestions, to do the annotation. To check the quality of the annotations, inter annotator agreement analysis was used. The results show a Cohen's Kappa score of 0.89 indicating a lot of agreement. This resource is very important for making NLP tools like POS taggers, syntactic parsers, and morphological analyzers for Dogri. Future work will include adding more tags, using pretrained language models to transfer information between languages, and covering more areas.

References
  • D. Engelhardt, J. Mach. Learn. Res., 21(203), 1–30, 2020.
  • K. Gallagher et al., bioRxiv, 2023-04, 2023.
  • J. Horwood and E. Noutahi, ACS Omega, 5(51), 32984–32994, 2020.
  • M. Madondo et al., arXiv preprint, arXiv:2506.10073, 2025.
  • K. Gallagher et al., Cancer Res., 84(11), 1929–1941, 2024.
  • M. Korshunova et al., Commun. Chem., 5(1), 129, 2022.
  • M. Liu, X. Shen, and W. Pan, Stat. Med., 41(20), 4034–4056, 2022.
  • J. N. Eckardt et al., Cancers, 13(18), 4624, 2021.
  • S. Pandiyan and L. Wang, Comput. Biol. Med., 150, 106140, 2022.
  • C. Li et al., Phys. Med., 125, 104498, 2024.
  • H. Mashayekhi et al., Comput. Methods Programs Biomed., 243, 107884, 2024.
  • M. Popova, O. Isayev, and A. Tropsha, Sci. Adv., 4(7), eaap7885, 2018.
  • R. Özçelik et al., J. Chem. Inf. Model., 65(14), 7352–7372, 2025.
  • L. Wang et al., Pharmaceuticals, 16(2), 253, 2023.
  • F. G. Albani et al., Drug Des. Dev. Ther., 5685–5707, 2025.
  • H. G. Svensson et al., Mach. Learn., 113(7), 4811–4843, 2024.
  • A. Ünlü et al., Nat. Mach. Intell., 1–17, 2025.
  • M. H. N. Le et al., Biochim. Biophys. Acta, 167680, 2025.
  • S. Herráiz-Gil et al., Appl. Sci., 15(5), 2798, 2025.
  • Takahiro Eitsuka, Naoto Tatewaki, Hiroshi Nishida, Kiyotaka Nakagawa, and Teruo Miyazawa. 2016. Synergistic anticancer effect of tocotrienol combined with chemotherapeutic agents or dietary components: A review. International Journal of Molecular Sciences 17, 10 (2016), 1605.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Dogri language; Part of speech tagging; ILPOSTS; low resource NLP; annotated corpus; Indian languages; inter annotator agreement; linguistic annotation; web-based annotation tool; Indo Aryan languages

Powered by PhDFocusTM