Research Article

Web scraping localized parallel multilingual help content in Indian languages

by  S. Winston Cruz, G. Roch Libia Rani
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 69
Published: December 2025
Authors: S. Winston Cruz, G. Roch Libia Rani
10.5120/ijca2025926157
PDF

S. Winston Cruz, G. Roch Libia Rani . Web scraping localized parallel multilingual help content in Indian languages. International Journal of Computer Applications. 187, 69 (December 2025), 35-42. DOI=10.5120/ijca2025926157

                        @article{ 10.5120/ijca2025926157,
                        author  = { S. Winston Cruz,G. Roch Libia Rani },
                        title   = { Web scraping localized parallel multilingual help content in Indian languages },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 69 },
                        pages   = { 35-42 },
                        doi     = { 10.5120/ijca2025926157 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A S. Winston Cruz
                        %A G. Roch Libia Rani
                        %T Web scraping localized parallel multilingual help content in Indian languages%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 69
                        %P 35-42
                        %R 10.5120/ijca2025926157
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

The need for multilingual corpora has witnessed a quantum leap with the development in web mining and large language models (LLM). Multilingual data extraction from websites is a way of developing parallel corpora. Controlled use of web scraping is a useful technique for the creation of this corpus. Among the various types of localized content on the web, including the machine translations, content produced engaging human translators and reviewers are most useful followed by machine translated content that has undergone human post editing. The help center documents and the terms and conditions documents available on the websites in different languages come under these categories. In this paper, such content is manually identified and the issues in scraping them are discussed. A Python code that uses BeautifulSoup library for extracting these materials in various Indian languages like Hindi, Kannada and Tamil is presented. The concerns related to arranging the content parallelly with their source in English is then discussed. Finally, details of the sample parallel corpus extracted is analyzed and presented.

References
  • Shaharbanu, A., & McDonald, S. (2025, 08 01). Legality of data scraping under Indian law. India Business Law Journal. Retrieved 10 30, 2025, from https://law.asia/ india-data-scraping-regulation/
  • Lotfi, C., Srinivasan, S., Ertz, M., & Latrous, I. (2022). Web Scraping Techniques and Applications: A Literature Review. In R. Pal & P. K. Shukla (Eds.), SCRS Conference Proceedings on Intelligent Systems (pp. 381-394). Soft Computing Research Society. https://doi.org/10.524 58/978-93-91842-08-6-38
  • Gupta, P., & Jamwal, S. S. (2025). Enhancing NLP for Low-Resource Language by Developing Deep Learning-Powered Morphological Analysis of Dogri: An End-to-End Pipeline from Corpus Construction and Linguistic Annotation to Model Training and Deployment. SN Computer Science, 6. https://link.springer.com/article/10. 1007/s42979-025-04429-9
  • Bale, A. S., Ghorpade, N., S, R., Kamalesh, S., R, R., & S, R. B. (2022). Web Scraping Approaches and their Performance on Modern Websites. In 2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 956-959). IEEE. 10.1109/ICESC54411.2022.9885689
  • NHAI, Ministry of Road Transport and Highways. (n.d.). Terms & Conditions. National Highways Authority of India. Retrieved November 18, 2025, from https://nhai. gov.in/#/terms-conditions
  • Agarwal, M., Alam, M. M. I., & Anastasopoulos, A. (2023). LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 14496–14519). Association for Computational Linguistics.
  • Ingle, Y., & Mishra, P. (2025). ILID: native script language identification for Indian languages. arXiv, 2507.11832v2. arXiv. https://doi.org/10.48550/arXiv.250 7.11832
  • Hao, S., Han, W., Jiang, T., Li, Y., Wu, H., Zhong, C., Zhou, Z., & Tao, H. (2024). Synthetic data in AI: Challenges, applications, and ethical implications. arXiv pre-print arXiv, 2401.01629.
  • Amazon.com, Inc. (n.d.). Online Shopping site in India: Shop Online for Mobiles, Books, Watches, Shoes and More. Amazon.in. Retrieved November 18, 2025, from http://www.amazon.in
  • Google. (n.d.). Cloud Translation. Google Cloud. Retrieved September 15, 2025, from https://cloud. google.com/translate
  • Microsoft. (n.d.). Translator Text API. Microsoft. Retrieved September 17, 2025, from https://www.microsoft. com/en-us/translator/business/translator-api/
  • Amazon Web Services. (2025). Amazon Translate API Reference - Amazon Translate API Reference. AWS Documentation. Retrieved November 18, 2025, from https://docs.aws.amazon.com/translate/latest/APIReference/welcome.html
  • Times of India. (2024, June 24). Flipkart launches support for Tamil, Telugu and Kannada on app. Times of India. https://timesofindia.indiatimes.com/gadgets-news/ flipkart-launches-support-for-tamil-telugu-and-kannada-on-app/articleshow/76558247.cms
  • Flipkart. (2025, September 2). Flipkart product returns process – your returns policy questions answered. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/flipkart-product-returns-2/
  • Flipkart. (2023, February 9). फ्लिपकार्ट प्रोडक्ट रिटर्न प्रक्रिया - रिटर्न पालिसी के सभी सवालों के जवाब. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/फ्लिपकार्ट-रिटर्न्स/
  • Flipkart. (2023, February 9). ஃபிளிப்கார்ட் தயாரிப்பு திரும்பப்பெறும் செயல்முறை – இது எவ்வாறு இயங்குகிறது மற்றும் நீங்கள் மனதில் கொள்ள வேண்டியவை. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/ஃப்ளிப்கார்ட்-திரும்புக/
  • Flipkart. (2023, February 9). ಫ್ಲಿಪ್ಕಾರ್ಟ್‌ ಉತ್ಪನ್ನ ಹಿಂದಿರುಗಿಸುವ ಪ್ರಕ್ರಿಯೆ – ಅದು ಹೇಗೆ ಕೆಲಸಮಾಡುತ್ತದೆ ಮತ್ತು ನೀವು ಏನನ್ನು ನೆನಪಿನಲ್ಲಿಡಬೇಕು. Flipkart Stories. https://stories.flipkart.com/ಫ್ಲಿಪ್ಕಾರ್ಟ್-ಹಿಂದಿರುಗಿ/
  • Microsoft. (2024). Microsoft® Office Language Accessory Pack – Tamil. Microsoft. Retrieved January 05, 2025, from https://www.microsoft.com/ta-in/download/details.aspx?id=51200
  • Microsoft. (n.d.). MSN | Personalised News, Top Head-lines, Live Updates and more. msn. Retrieved May 20, 2025, from https://www.msn.com/en-ae?ocid=msedgdhp &pc=U531&cvid=691c4d3821d94f51a3ac5e6d618a607e&ei=11
  • Microsoft. (n.d.). Microsoft translator | translate from English. Microsoft Bing. Retrieved May 10, 2025, from https://www.bing.com/translator
  • Microsoft. (2025, July 30). Microsoft Change Locale. Microsoft Services Agreement. Retrieved August 20, 2025, from https://www.microsoft.com/en-in/services agreement/locale
  • Google. (n.d.). Google. Google. Retrieved May 20, 2025, from https://www.google.com/
  • Google. (2024, May 22). Google Terms of Service – Privacy & Terms – Google. Google Policies. Retrieved August 20, 2025, from https://policies.google.com/terms?hl=en-IN&fg=1
  • SketchEngine. (n.d.). Setting up parallel and multilingual corpora. SketchEngine. Retrieved October 23, 2025, from https://www.sketchengine.eu/guide/setting-up-parallel-corpora/#tab-id-2
  • OpenAI. (n.d.). ChatGPT. [Large language model]. https://chatgpt.com/
  • Google. (n.d.). Welcome To Colab - Colab. Colab. Retrieved October 10, 2025, from https://colab.research. google.com/
  • YouTube. (2022, January 5). Terms of Service. YouTube IN. Retrieved May 20, 2025, from https://www.youtube. com/t/terms?hl=en&override_hl=1
  • Apple Inc. (n.d.). iPad User Guide. Apple Support. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/welcome/ipados
  • Apple Inc. (2025). Find and download games in the Apple Games app on iPad. iPad User Guide. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/ipad3aa36b02/ipados
  • Apple Inc. (2025). Add text on a Freeform board on iPad. iPad User Guide. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/ipad5a22ec43/ipados
  • Apple Inc. (2025). iPad पर Freeform बोर्ड में टेक्स्ट जोड़ें. iPad यूज़र गाइड. Retrieved May 20, 2025, from https://support.apple.com/hi-in/guide/ipad/ipad5a22ec43/ipados
  • Apple Inc. (2025). iPadನಲ್ಲಿನ Freeform ಬೋರ್ಡ್‌ನಲ್ಲಿ ಸ್ಟಿಕಿ ಟಿಪ್ಪಣಿಗಳು, ಆಕಾರಗಳು ಮತ್ತು ಪಠ್ಯ ಬಾಕ್ಸ್‌ಗಳಲ್ಲಿ ಪಠ್ಯವನ್ನು ಸೇರಿಸುವುದು. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 20, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad5a22ec43/ipados
  • Apple Inc. (2025). iPadஇல் உள்ள Freeform போர்டில் ஸ்டிக்கி நோட்ஸ், வடிவங்கள் மற்றும் உரைப் பெட்டிகளில் உரையைச் சேர்த்தல். iPad பயனர் வழிகாட்டி. Retrieved May 20, 2025, from https://support.apple.com/ta-in/guide/ipad/ipad5a22ec43/ipados
  • Lehmann, T. (1993). A grammar of Modern Tamil (2nd ed.). Pondicherry Institute of Linguistics and Culture.
  • Apple Inc. (2025). Wake, unlock, and lock iPad. Apple Support. Retrieved May 21, 2025, from https://support.apple.com/en-in/guide/ipad/ipad9940ee8d/ipados
  • Apple Inc. (2025). iPad सक्रिय करें, अनलॉक और लॉक करें. Retrieved May 21, 2025, from https://support.apple.com/hi-in/guide/ipad/ipad9940ee8d/ipados
  • Apple Inc. (2025). iPad ಅನ್ನು ಎಚ್ಚರಗೊಳಿಸಿ, ಅನ್‌ಲಾಕ್ ಮಾಡಿ ಮತ್ತು ಲಾಕ್ ಮಾಡಿ. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 21, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad9940ee8d/ipados
  • Apple Inc. (2025). Send and reply to messages on iPad. Apple Support. Retrieved May 21, 2025, from https://support.apple.com/en-in/guide/ipad/ipad99acb44a/ipados.
  • Apple Inc. (2025). iPadನಲ್ಲಿನ Freeform ಬೋರ್ಡ್‌ನಲ್ಲಿ ಸ್ಟಿಕಿ ಟಿಪ್ಪಣಿಗಳು, ಆಕಾರಗಳು ಮತ್ತು ಪಠ್ಯ ಬಾಕ್ಸ್‌ಗಳಲ್ಲಿ ಪಠ್ಯವನ್ನು ಸೇರಿಸುವುದು. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 21, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad99acb44a/ipados.
Index Terms
Computer Science
Information Sciences
Parallel corpora
Indian languages
web mining
Keywords

Web scraping BeautifulSoup localization Tamil Kannada Hindi

Powered by PhDFocusTM