Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values

R.S. Somasundaram; R. Nedunchezhian

Research Article

Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values

by R.S. Somasundaram, R. Nedunchezhian

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 21 - Issue 10

Published: May 2011

Authors: R.S. Somasundaram, R. Nedunchezhian

10.5120/2619-3544

PDF

R.S. Somasundaram, R. Nedunchezhian . Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values. International Journal of Computer Applications. 21, 10 (May 2011), 14-19. DOI=10.5120/2619-3544

                        @article{ 10.5120/2619-3544,
                        author  = { R.S. Somasundaram,R. Nedunchezhian },
                        title   = { Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values },
                        journal = { International Journal of Computer Applications },
                        year    = { 2011 },
                        volume  = { 21 },
                        number  = { 10 },
                        pages   = { 14-19 },
                        doi     = { 10.5120/2619-3544 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2011
                        %A R.S. Somasundaram
                        %A R. Nedunchezhian
                        %T Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values%T 
                        %J International Journal of Computer Applications
                        %V 21
                        %N 10
                        %P 14-19
                        %R 10.5120/2619-3544
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

One of the important stages of data mining is preprocessing, where the data is prepared for different mining tasks. Often, the real-world data tends to be incomplete, noisy, and inconsistent. It is very common that the data are not obtainable for every observation of every variable. So the presence of missing variables is obvious in the data set. A most important task when preprocessing the data is, to fill in missing values, smooth out noise and correct inconsistencies. This paper presents the missing value problem in data mining and evaluates some of the methods generally used for missing value imputation. In this work, three simple missing value imputation methods are implemented namely (1) Constant substitution, (2) Mean attribute value substitution and (3) Random attribute value substitution. The performance of the three missing value imputation algorithms were measured with respect to different rate or different percentage of missing values in the data set by using some known clustering methods. To evaluate the performance, the standard WDBC data set has been used.

References

Thomas Lumley, "Missing data", A Lecture Note, BIOST 570, 2005-11-9
Zhang, S.C., et al., (2004). Information Enhancement for Data Mining. IEEE Intelligent Systems, 2004, Vol. 19(2): 12-13.
Qin, Y.S., et al. (2007). Semi-parametric Optimization for Missing Data Imputation. Applied Intelligence, 2007, 27(1): 79-88.
Zhang, C.Q., et al., (2007). An Imputation Method for Missing Values. PAKDD, LNAI, 4426, 2007: 1080-1087.
Quinlan, J.R. (1993). C4.5 : Programs for Machine Learning. Morgan Kaufmann, San Mateo, USA, 1993.
Han, J., and Kamber, M., (2006). Data Mining: Concepts and Techniques . Morgan Kaufmann Publishers, 2006, 2nd edition.
Chen, J., and Shao, J., (2001). Jackknife variance estimation for nearest-neighbor imputation. J. Amer. Statist. Assoc. 2001, Vol.96: 260-269.
Lall, U., and Sharma, A., (1996). A nearest-neighbor bootstrap for resampling hydrologic time series. Water Resource. Res. 2001, Vol.32: 679-693.
Chen, S.M., and Chen, H.H., (2000). Estimating null values in the distributed relational databases environments. Cybernetics and Systems: An International Journal. 2000, Vol.31: 851-871.
Chen, S.M ., and Huang, C.M ., (2003). Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms. IEEE Transactions on Fuzzy Systems. 2003, Vol.11: 495-506.
Magnani, M., (2004). Techniques for dealing with missing data in knowledge discovery tasks. Available from http://magnanim.web.cs.unibo.it/data/pdf/missingdata.pdf, Version of June 2004.
Kahl, F., et al., (2001). Minimal Projective Reconstruction Including Missing Data. IEEE Trans. Pattern Anal. Mach. Intell., 2001, Vol. 23(4): 418-424.
Gessert , G., (1991). Handling Missing Data by Using Stored Truth Values. SIGMOD Record, 2001, Vol. 20(3): 30-42.
Pesonen, E., et al., (1998). Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artificial Intelligence in Medicine,1998, Vol. 13(3): 139-146.
Ramoni, M. and Sebastiani, P. (2001). Robust Learning with Missing Data. Machine Learning, 2001, Vol. 45(2): 147-170.
Pawlak, M., (1993). Kernel classification rules from missing data. IEEE Transactions on Information Theory, 39(3): 979-988.
Forgy , E., (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics , 1965, Vol. 21: 768
Blake, C.L and Merz, C.J (1998). UCI Repository of machine learning databases.
Hamerly, H., and Elkan, C., (2003). Learning the k in k-means. Proc. of the 17th intl. Conf. of Neural Information Processing System.
Zhang, S.C., et al., (2006). Optimized Parameters for Missing Data Imputation.PRICAI06, 2006: 1010-1016.
Wang, Q., and Rao, J., (2002a). Empirical likelihood-based inference in linear models with mis sing data. Scand. J. Statist., 2002, Vol. 29 : 563-576.
Wang, Q. and Rao, J. N. K. (2002b). Empirical likelihood-based inference under imputation for mis sing response data. Ann. Statist., 30: 896-924.
Silverman, B., (1986). Density Estimation for Statistics and Data Analysis . Chapman and Hall, New York.
Friedman, J., et al., (1996). Laz y Decision Trees. Proceedings of the 13th National Conference on Artificial Intelligence, 1996: 717-724.
John, S., and Cristianini, N., (2004). Kernel Methods for Pattern Analysis. Cambridge.
Lakshminarayan, K., et al., (1996). Imputation of Missing Data Using Machine Learning Techniques. KDD-1996: 140-144

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Datamining Preprocessing Imputation methods Missing data valu