Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields

Manikrao L Dhore; Shantanu K Dixit; Tushar D Sonwalkar

Research Article

Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields

by Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 48 - Issue 23

Published: June 2012

Authors: Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar

10.5120/7522-0624

PDF

Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar . Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields. International Journal of Computer Applications. 48, 23 (June 2012), 31-37. DOI=10.5120/7522-0624

                        @article{ 10.5120/7522-0624,
                        author  = { Manikrao L Dhore,Shantanu K Dixit,Tushar D Sonwalkar },
                        title   = { Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields },
                        journal = { International Journal of Computer Applications },
                        year    = { 2012 },
                        volume  = { 48 },
                        number  = { 23 },
                        pages   = { 31-37 },
                        doi     = { 10.5120/7522-0624 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2012
                        %A Manikrao L Dhore
                        %A Shantanu K Dixit
                        %A Tushar D Sonwalkar
                        %T Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields%T 
                        %J International Journal of Computer Applications
                        %V 48
                        %N 23
                        %P 31-37
                        %R 10.5120/7522-0624
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Machine transliteration has received significant research attention in recent years. In most cases, the source language has been English and the target language is an Asian language. This paper focuses on Hindi to English machine transliteration of Indian named entities such as proper nouns, place names and organization names using conditional random fields (CRF). Hindi is the national language of the India and spoken by more than 500 millions Indian. Hindi is the world's fourth most commonly used language after Chinese, English and Spanish. This system takes Indian place name as an input in Hindi language using Devanagari script and transliterates it into English. The input to the system is provided in the form of syllabification in order to apply the n-gram techniques. As more than 50% named entities are formed as a combination of two and three syllabic units, the n-gram approach with unigrams, bigrams and trigrams of Hindi are used to train the corpus. The system provides the satisfactory performance for trigrams as compared to unigrams and bigrams.

References

Ankit Aggarwal, Transliteration involving English and Hindi languages using syllabification approach, Thesis, Indian Institute of Technology, Bombay, Mumbai, 2009
Haizhou Li, A Kumaran, Vladimir Pervouchine and Min Zhang, Report of NEWS 2009 Machine transliteration shared task, named entities workshop: shared task on transliteration, Singapore, pp. 1-18, 2009
Darvinder kaur, Vishal Gupta, A survey of named entity recognition in English and other Indian languages, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, pp. 239-245, November 2010
Karimi S, Scholer F, and Turpin, Machine transliteration survey. ACM Computing Surveys, Vol. 43, No. 3, Article 17, pp. 1-46, April 2011.
Arbabi M, Fischthal S M, Cheng V C and Bart E, Algorithms for Arabic name transliteration, IBM Journal of Research and Development. pp. 183-194, 1994
Knight Kevin and Graehl Jonathan, Machine transliteration. In proceedings of the 35th annual meetings of the Association for Computational Linguistics, pp. 128-135, 1998
Stalls Bonnie Glover and Kevin Knight, Translating names and technical terms in Arabic text. 1998
Al-Onaizan Y, Knight K, Machine translation of names in Arabic text. Proceedings of the ACL conference workshop on computational approaches to Semitic languages. 2002
Nasreen Abdul Jaleel and Leah S. Larkey, Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th international conference on information and knowledge management. pp: 139 – 146, 2003
K Knight, J. Graehl, Machine transliteration , Computational Linguist, pp. 128–135, 1997
S. Y. Jung,, S. Hong, S. , E. Paek,. English to Korean transliteration model of extended Markov window, In Proceedings of the 18th Conference on Computational Linguistics, pp. 383–389, 2003
R. K. Joshi, K. Shroff , S. P. Mudur, A Phonemic Code Based Scheme for Effective Processing of Indian Languages 23rd Internationalization and Unicode Conference, Prague, Czech Republic, 1 March 2003.
M. Ganapathiraju, M. Balakrishnan, N. Balakrishnan, R. Reddy. OM: One Tool for Many (Indian) Languages. ICUDL: International Conference on Universal Digital Library, Hangzhou, 2005.
M. G. A. Malik, Punjabi Machine Transliteration, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 1137–1144, 2006
R Sproat. Brahmi scripts, In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2002.
R. Sproat, A formal computational analysis of Indic scripts, In International Symposium on Indic Scripts: Past and Future, Tokyo, Dec. 2003.
R. Sproat, A computational theory of writing systems, In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2004.
M. Kopytonenko, K. Lyytinen, and T. Krkkinen, "Comparison of phonological representations for the grapheme-to-phoneme mapping", In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2006.
Ganesh S, Harsha S, Pingali P, and Verma V, Statistical transliteration for cross language information retrieval using HMM alignment and CRF. In Proceedings of the Workshop on CLIA, Addressing the Needs of Multilingual Societies, 2008
Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar, and Pabitra Mitra, Named entity recognition in Hindi using maximum entropy and transliteration, 2008
A Ekbal and S. Bandyopadhyay, A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies, Proceedings of 2nd International conference in Pattern Recognition and Machine Intelligence, Kolkata, India, pp. 545–552, 2007
A Ekbal and S. Bandyopadhyay, Bengali named entity recognition using support vector machine, in Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian languages, Hyderabad, India, pp. 51–58, January 2008
A Ekbal and S. Bandyopadhyay, Development of Bengali named entity tagged corpus and its use in NER system, in Proceedings of the 6th Workshop on Asian Language Resources, 2008.
A Ekbal and S. Bandyopadhyay, A web-based Bengali news corpus for named entity recognition, Language Resources & Evaluation, vol. 42, pp. 173–182, 2008.
A Ekbal and S. Bandyopadhyay, Improving the performance of a NER system by post-processing and voting, in Proceedings of Joint IAPR International Workshop on Structural Syntactic and Statistical Pattern Recognition, Orlando, Florida, pp. 831–841, 2008
A Ekbal and S. Bandyopadhyay, Bengali Named Entity Recognition using Classifier Combination, in Proceedings of Seventh International Conference on Advances in Pattern Recognition, pp. 259–262, 2009
A Ekbal and S. Bandyopadhyay, Voted NER system using appropriate unlabelled data, in Proceedings of the Named Entities Workshop, ACL-IJCNLP 2009,
A Ekbal and S. Bandyopadhyay, Named entity recognition using appropriate unlabeled data, post-processing and voting. In Informatica, Volume (34), No. 1, pp. 55-76, 2010.
Manoj K. Chinnakotla, Om P. Damani, and Avijit Satoskar, Transliteration for Resource-Scarce Languages, ACM Trans. Asian Lang. Inform. Process. 9, 4, Article 14, pp 1-30, December 2010
Jong-Hoon Oh, Kiyotaka Uchimoto, and Kentaro Torisawa, Machine transliteration using target-language grapheme and phoneme: Multi-engine transliteration approach, Proceedings of the Named Entities Workshop, ACL-IJCNLP Suntec, Singapore,AFNLP, pp. 36–39, August 2009
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data,. In International Conference on Machine Learning, 2001.
Hanna M. Wallach, Conditional Random Fields: An introduction, University of Pennsylvania CIS Technical Report MS-CIS-04-21, February , 2004
Charles Sutton and Andrew McCallum, An Introduction to conditional random fields for relational learning, University of Massachusetts, USA
http://www. whereincity. com/babynames
http://en. wikipedia. org/wiki/list_of_cities_in_India
http://www. indianchild. com/
http://encyclopedia. thefreedictionary. com/
Road Atlas Rajasthan – by Government of India, 2008
Road Atlas Utter Pradesh – by Government of India, 2008
Road Atlas Jharkhand – by Government of India, 2008
Road Atlas Bihar – by Government of India, 2008
Road Atlas Madya Pradesh – by Government of India, 2008
Road Atlas Maharashtra – by Government of India, 2008
Tourist Guide India - by Government of India, 2008
Tourist Guide Maharashtra - by Government of India, 2008
Haizhou Li, A Kumaran, Vladimir Pervouchine and Min Zhang, Report of NEWS 2009 Machine Transliteration Shared Task, ACL-IJCNLP, pp. 1-19, 2009

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Bigram Conditional Random Fields Trigram Transliteration Syllabification