Text Extraction from Document Images- A Review

Deepika Ghai; Neelu Jain

Research Article

Text Extraction from Document Images- A Review

by Deepika Ghai, Neelu Jain

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 84 - Issue 3

Published: December 2013

Authors: Deepika Ghai, Neelu Jain

10.5120/14559-2661

PDF

Deepika Ghai, Neelu Jain . Text Extraction from Document Images- A Review. International Journal of Computer Applications. 84, 3 (December 2013), 40-48. DOI=10.5120/14559-2661

                        @article{ 10.5120/14559-2661,
                        author  = { Deepika Ghai,Neelu Jain },
                        title   = { Text Extraction from Document Images- A Review },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 84 },
                        number  = { 3 },
                        pages   = { 40-48 },
                        doi     = { 10.5120/14559-2661 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2013
                        %A Deepika Ghai
                        %A Neelu Jain
                        %T Text Extraction from Document Images- A Review%T 
                        %J International Journal of Computer Applications
                        %V 84
                        %N 3
                        %P 40-48
                        %R 10.5120/14559-2661
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Text extraction in an image is a challenging task in the computer vision. Text extraction plays an important role in providing useful and valuable information. This paper discusses various approaches such as Adaptive Local Connectivity Map (ALCM), Expectation Maximization (EM), Maximization Likelihood (ML), Markov Random Field (MRF), Spiral Run Length Smearing Algorithm (SRLSA), Curvelet transform etc. for extracting text from scanned book covers, journals, multi-color document, handwritten document, ancient document and newspaper document images. Text line segmentation is a major component for document image analysis. Text in documents depend upon various factors such as language, styles, font, sizes, color, background, orientation, fluctuating text lines, crossing or touching text lines. This paper provides performance comparison of several existing methods suggested by researchers in document text extraction on the basis of recall rate, precision rate, processing time, accuracy etc.

References

N. Anupama, C. Rupa, E. S. Reddy, Character Segmentation for Telugu Image Document using Multiple Histogram Projections, Global Journal of Computer Science and Technology Graphics and Vision, 13 (2013) 11-16.
S. Malakar, S. Halder, R. Sarker, N. Das, S. Basu, M. Nasipuri, Text line Extraction from Handwritten Document pages using spiral run length smearing algorithm, International Conference on communications, Devices and Intelligent Systems, Kolkata, Dec. 28-29 (2012) 616-619.
S. J. Ha, B. Jin, N. I. Cho, Fast Text Line Extraction in Document Images, 19th IEEE International Conference on Image Processing, Orlando, Sept. 30-Oct 3 (2012) 797-800.
S. V. Seeri, S. Giraddi, Prashant B. M, A Novel Approach for Kannada Text Extraction, Proceedings of the International Conference on Pattern Recognition, Informatics and Medical Engineering, Tamil Naidu, Mar. 21-23 (2012) 444-448.
Z. Li, J. Luo, Resolution Enhancement from Document Images for Text Extraction, 5th International Conference on Multimedia and Ubiquitous Engineering, Loutraki, June 28-30 (2011) 251-256.
D. Zaravi, H. Rostami, A. Malahzaheh, S. S Mortazavi, Journals Subheadlines Text Extraction Using Wavelet Thresholding and New Projection Profile, World Academy of Science, Engineering and Technology, 49 (2011) 686-689.
T. V. Hoang, S. Tabbone, Text Extraction From Graphical Document Images Using Sparse Representation, International Workshop on Document Analysis Systems, June 9-11 (2010) 143-150.
P. Nagabhushan, S. Nirmala, Text Extraction in Complex Color Document Images for Enhanced Readability, Intelligent Information Management, 2 (2010) 120-133.
V. K. Koppula, N. Atul, U. Garain, Robust Text Line, Word And Character Extraction From Telugu Document Image, 2nd International Conference on Emerging Trends in Engineering and Technology, Dec. 16-18 (2009) 269-272.
R. P. D. Santos, G. S. Clemente, T. I. Ren, G. D. C. Calvalcanti, Text Line Segmentation Based on Morphology and Histogram Projection, 10th International Conference on Document Analysis and Recognition, Spain, July 26-29 (2009) 651-655.
H. Kawano, H. Orii, H. Maeda, N. Ikoma, Text Extraction from Degraded Document Image Independent of Character Color Based on MAP-MRF Approach, IEEE, Jeju Island, Aug. 20-24 (2009) 165-168.
S. Grover, K. Arora, S. K. Mitra, Text Extraction from Document Images using Edge Information, IEEE, Gujarat, Dec. 18-20 (2009) 1-4.
S. Audithan, RM. Chandrasekaran, Document Text Extraction from Document Images Using Haar Discrete Wavelet Transform, European Journal of Scientific Research, 36 (2009) 502-512.
S. S. Bukhari, T. M. Breuel, F. Shafait, Textline Information Extraction from Grayscale Camera-Captured Document Images, ICIP Proceedings of the 16th IEEE International Conference on Image Processing, Cairo, Nov. 7-10 (2009) 2013 – 2016.
W. Boussellaa, A. Bougacha, A. Zahour, H. E. Abed, A. Alimi, Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm, 10th International Conference on Document Analysis and Recognition, Barcelona, July 26-29 (2009) 743-747.
Z. Shi, S. Setlur, V. Govindaraju, A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines, 10th International Conference on Document Analysis and Recognition, Barcelona, July 26-29 (2009) 176-180.
D. Sarkar, R. Ghosh, A Bottom-Up Approach of Line Segmentation from Handwritten Text, (2009).
Y. L. Qiao, M. Li, Z. M. Lu, S. H. Sun, Gabor Filter Based Text Extraction from Digital Document Images, Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, USA, (2006) 297-300.
A. Lemaitre, J. Camillerapp, Text Line Extraction in Handwritten Document with Kalman Filter Applied on Low Resolution Image, Proceedings of the 2nd International Conference on Document Image Analysis for Libraries, Lyon, April 27-28 (2006) 45-52.
Z. Shi, S. Setlur, V. Govindaraju, Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map, Proceedings of the 8th International Conference on Document Analysis and Recognition, Aug, 29- Sept. 1 (2005) 794-798.
Y. J. Song, K. C. Kim, Y. W. Choi, H. R. Byun, S. H. Kim, S. Y. Chi, D. K. Jang, Y. K. Chung, Text Region Extraction and Text Segmentation on Camera-captured Document Style Images, Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition, Aug. 29-Sept. 1 (2005) 172-176.
S. Raju S, P. B. Pati, A. G. Ramakrishnan, Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images, Proceedings of the Ist International Workshop on Document Image Analysis for Libraries, (2004) 233-243.
A. Negi, N. Kasinadhuni, Localization and Extraction of Text in Telugu Document Images, Proceedings of the 7th International Conference on Document Analysis and Recognition, Oct. 15-17 (2003) 749-752.
A. R. Chaudhuri, A. K. Mandal, B. B. Chaudhuri, Page Layout Analyser for Multilingual Indian Documents, Proceedings of the Language Engineering Conference, (2002).
Q. Yuan, C. L. Tan, Text Extraction from Gray Scale Document Images Using Edge Information, Washington, Sept. 10-13 (2001) 302-306.
K. Sobottka, H. Bunke, H. Kronenberg, Identification of Text on Colored Book and Journal Covers, Document Analysis and Recognition, Bangalore, Sept. 20-22 (1999) 57-62.
H. M. Suen, J. F. Wang, Text string extraction from images of colour-printed documents, IEEE Proceedings of Vision, Image and Signal Processing, 143 (1996) 210-216.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Optical Character Recognition (OCR) Morphological Component Analysis (MCA) Undecimated Wavelet Transform (UWT) Discrete Wavelet Transform (DWT) Connected Component Analysis (CCA) Adaptive Local Connectivity Map (ALCM) Expectation Maximization (EM) Maximum Likelihood (ML) Spiral Run Length Smearing Algorithm (SRLSA) Resolution Enhancement (RE) Markov Random Field (MRF) Maximum A-posteriori Probability (MAP) Block Energy Analysis (BEA) Support Vector Machine (SVM) Thin Line Coding (TLC) Constrained Run Length Algorithm (CRLA).