International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 186 - Issue 25 |
Published: June 2024 |
Authors: Alaa Najmi, Mohamed A. El-Dosuky |
![]() |
Alaa Najmi, Mohamed A. El-Dosuky . Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents. International Journal of Computer Applications. 186, 25 (June 2024), 20-26. DOI=10.5120/ijca2024923718
@article{ 10.5120/ijca2024923718, author = { Alaa Najmi,Mohamed A. El-Dosuky }, title = { Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents }, journal = { International Journal of Computer Applications }, year = { 2024 }, volume = { 186 }, number = { 25 }, pages = { 20-26 }, doi = { 10.5120/ijca2024923718 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2024 %A Alaa Najmi %A Mohamed A. El-Dosuky %T Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents%T %J International Journal of Computer Applications %V 186 %N 25 %P 20-26 %R 10.5120/ijca2024923718 %I Foundation of Computer Science (FCS), NY, USA
Optical character recognition (OCR) is a crucial technique for extracting textual data from various sources, reducing human labor, and enhancing accessibility. Named Entity Recognition (NER) organizes and categorizes data, while Regular expression (Regex) patterning facilitates data extraction from OCR-read text. This technology reduces human labor for extracting large amounts of confidential and sensitive data, improving accessibility and preservation, especially in confidential and sensitive situations. The study utilizes the Tesseract OCR tool and the Marefa-NER NER Model, combining Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Natural Language Processing (NLP) techniques. The technologies have been successfully integrated into websites, and have proven their effectiveness in accurately identifying textual content and categorizing it using OCR, NER, and Regex patterns. The combination of OCR, NER, and Regex pattern matching has proven to be a successful and efficient method for extracting textual information from various sources, reducing human effort and improving accessibility, particularly in cases of confidentiality and sensitivity.