Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents

Alaa Najmi; Mohamed A. El-Dosuky

Research Article

Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents

by Alaa Najmi, Mohamed A. El-Dosuky

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Issue 25

Published: June 2024

Authors: Alaa Najmi, Mohamed A. El-Dosuky

10.5120/ijca2024923718

PDF

Alaa Najmi, Mohamed A. El-Dosuky . Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents. International Journal of Computer Applications. 186, 25 (June 2024), 20-26. DOI=10.5120/ijca2024923718

                        @article{ 10.5120/ijca2024923718,
                        author  = { Alaa Najmi,Mohamed A. El-Dosuky },
                        title   = { Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents },
                        journal = { International Journal of Computer Applications },
                        year    = { 2024 },
                        volume  = { 186 },
                        number  = { 25 },
                        pages   = { 20-26 },
                        doi     = { 10.5120/ijca2024923718 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2024
                        %A Alaa Najmi
                        %A Mohamed A. El-Dosuky
                        %T Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents%T 
                        %J International Journal of Computer Applications
                        %V 186
                        %N 25
                        %P 20-26
                        %R 10.5120/ijca2024923718
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Optical character recognition (OCR) is a crucial technique for extracting textual data from various sources, reducing human labor, and enhancing accessibility. Named Entity Recognition (NER) organizes and categorizes data, while Regular expression (Regex) patterning facilitates data extraction from OCR-read text. This technology reduces human labor for extracting large amounts of confidential and sensitive data, improving accessibility and preservation, especially in confidential and sensitive situations. The study utilizes the Tesseract OCR tool and the Marefa-NER NER Model, combining Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Natural Language Processing (NLP) techniques. The technologies have been successfully integrated into websites, and have proven their effectiveness in accurately identifying textual content and categorizing it using OCR, NER, and Regex patterns. The combination of OCR, NER, and Regex pattern matching has proven to be a successful and efficient method for extracting textual information from various sources, reducing human effort and improving accessibility, particularly in cases of confidentiality and sensitivity.

References

Satti, Danish Altaf. "Offline Urdu Nastaliq OCR for printed text using analytical approach." MS thesis report (2013): 141.
Al-Badr, Badr, and Sabri A. Mahmoud. "Survey and bibliography of Arabic optical text recognition." Signal processing 41, no. 1 (1995): 49-77.
Grishman, Ralph, and Beth M. Sundheim. "Message understanding conference-6: A brief history." In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics. 1996.
Lee S, Lee G, 2005, Proceedings of the International Joint Conference on Natural Language Processing, October 11-13, 2005: Heuristic Methods for Reducing Errors of Geographic Named Entities Learned by 68 Volume 6; Issue 5 Bootstrapping. Springer Verlag, Jeju Island, Korea, 658-669.
Liu, Xing, Huiqin Chen, and Wangui Xia. "Overview of named entity recognition." Journal of Contemporary Educational Research 6, no. 5 (2022): 65-68.
Kukreja, Harsh, N. Bharath, C. S. Siddesh, and S. Kuldeep. "An introduction to artificial neural network." Int J Adv Res Innov Ideas Educ 1 (2016): 27-30.
Benítez-Peña, Sandra, Rafael Blanquero, Emilio Carrizosa, and Pepa Ramírez-Cobo. "Cost-sensitive probabilistic predictions for support vector machines." European Journal of Operational Research (2023).
Hannan, Shaikh Abdul, Jameel Ahmed, Naveed Ahmed, and Rizwan Alam Thakur. "Data Mining and Natural Language Processing Methods for Extracting Opinions from Customer Reviews." International Journal of Computational Intelligence and Information Security: 52-58.
Sætre, Rune. "GeneTUC: Natural Language Understanding in Medical Text." (2006).
Zollmann, Andreas, Ashish Venugopal, and Stephan Vogel. "Bridging the inflection morphology gap for Arabic statistical machine translation." In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 201-204. 2006.
Saif, Abdulgabbar Mohammed, and Mohd Juzaiddin Ab Aziz. "An automatic noun compound extraction from Arabic corpus." In 2011 International Conference on Semantic Technology and Information Retrieval, pp. 224-230. IEEE, 2011.
Zouaghi, Anis, Laroussi Merhbene, and Mounir Zrigui. "Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation." Artificial Intelligence Review 38, no. 4 (2012): 257-269.
A. Saif, M. J. Ab Aziz, and N. Omar, "Evaluating knowledge-based semantic measures on Arabic," International Journal on Communications Antenna and Propagation, vol. 4, pp. 180-194, 2014.
Saif, Abdulgabbar, Mohd Juzaiddin Ab Aziz, and Nazlia Omar. "Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features." Natural Language Engineering 23, no. 1 (2017): 53-91.
Alshaikhdeeb, Basel, and Kamsuriah Ahmad. "Biomedical named entity recognition: a review." International Journal on Advanced Science, Engineering and Information Technology 6, no. 6 (2016): 889-895.
Awel, Muna Ahmed, and Ali Imam Abidi. "Review on optical character recognition." International Research Journal of Engineering and Technology (IRJET) 6, no. 6 (2019): 3666-3669.
Islam, Noman, Zeeshan Islam, and Nazia Noor. "A survey on optical character recognition system." arXiv preprint arXiv:1710.05703 (2017).
Salah, Ramzi Esmail, and L. Qadri binti Zakaria. "A comparative review of machine learning for Arabic named entity recognition." International Journal on Advanced Science, Engineering and Information Technology 7, no. 2 (2017): 511-518.
Marefa Arabic Named Entity Recognition Model (huggingface.co/marefa-nlp/marefa-ner), Last access 2023/02/08.

Index Terms

Computer Science

Information Sciences

OCR

NER

Regex

Confidential documents

Sensitive data

Keywords

OCR NER Regex Confidential documents Sensitive data