International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 21 - Issue 10 |
Published: May 2011 |
Authors: R.S. Somasundaram, R. Nedunchezhian |
![]() |
R.S. Somasundaram, R. Nedunchezhian . Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values. International Journal of Computer Applications. 21, 10 (May 2011), 14-19. DOI=10.5120/2619-3544
@article{ 10.5120/2619-3544, author = { R.S. Somasundaram,R. Nedunchezhian }, title = { Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values }, journal = { International Journal of Computer Applications }, year = { 2011 }, volume = { 21 }, number = { 10 }, pages = { 14-19 }, doi = { 10.5120/2619-3544 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2011 %A R.S. Somasundaram %A R. Nedunchezhian %T Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values%T %J International Journal of Computer Applications %V 21 %N 10 %P 14-19 %R 10.5120/2619-3544 %I Foundation of Computer Science (FCS), NY, USA
One of the important stages of data mining is preprocessing, where the data is prepared for different mining tasks. Often, the real-world data tends to be incomplete, noisy, and inconsistent. It is very common that the data are not obtainable for every observation of every variable. So the presence of missing variables is obvious in the data set. A most important task when preprocessing the data is, to fill in missing values, smooth out noise and correct inconsistencies. This paper presents the missing value problem in data mining and evaluates some of the methods generally used for missing value imputation. In this work, three simple missing value imputation methods are implemented namely (1) Constant substitution, (2) Mean attribute value substitution and (3) Random attribute value substitution. The performance of the three missing value imputation algorithms were measured with respect to different rate or different percentage of missing values in the data set by using some known clustering methods. To evaluate the performance, the standard WDBC data set has been used.