A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data

Neelam Singh; Neha Garg; Janmejay Pant

Research Article

A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data

by Neelam Singh, Neha Garg, Janmejay Pant

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 92 - Issue 4

Published: April 2014

Authors: Neelam Singh, Neha Garg, Janmejay Pant

10.5120/15995-4844

PDF

Neelam Singh, Neha Garg, Janmejay Pant . A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data. International Journal of Computer Applications. 92, 4 (April 2014), 7-10. DOI=10.5120/15995-4844

                        @article{ 10.5120/15995-4844,
                        author  = { Neelam Singh,Neha Garg,Janmejay Pant },
                        title   = { A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data },
                        journal = { International Journal of Computer Applications },
                        year    = { 2014 },
                        volume  = { 92 },
                        number  = { 4 },
                        pages   = { 7-10 },
                        doi     = { 10.5120/15995-4844 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2014
                        %A Neelam Singh
                        %A Neha Garg
                        %A Janmejay Pant
                        %T A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data%T 
                        %J International Journal of Computer Applications
                        %V 92
                        %N 4
                        %P 7-10
                        %R 10.5120/15995-4844
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Clustering is one of the most effective methods for summarizing and analyzing datasets that are collection of data objects similar or dissimilar in nature. Clustering aims at finding groups, or clusters, of objects with similar attributes. Most clustering methods work efficiently for low dimensional data since distance measures are used to find dissimilarities between objects. High dimensional data, however, may contain attributes which are not required for defining clusters and irrelevant dimension may produce noise and will hide the clusters that are required to be created. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. In this paper we provide a short introduction to various approaches and challenges for high-dimensional data clustering.

References

R. Bellman, "Dynamic Programming". Princeton University Press, 1957.
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft, (1998), "When is 'nearest neighbor' meaningful?" In Proceedings of 7th International Conference on Database Theory (ICDT-1999), Jerusalem, Israel, pp. 217-235, (1999).
Yiu-ming Cheung, Hong Jia, "Unsupervised Feature Selection with Feature Clustering", IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 2012. doi 10. 1109/WI-IAT. 2012. 259
Hans-Peter Kriegel, Peer Kröger, Matthias Renz, Sebastian Wurst, "A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data", Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM) (Washington, DC: IEEE Computer Society): 205–257, 2005. doi:10. 1109/ICDM. 2005. 5, ISBN 0-7695-2278-5
L. Parsons, E. Haque, and H. Liu, "Subspace clustering for high dimensional data: a review". SIGKDD Explorations, 6(1):90–105. 2004.
R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, "Automatic subspace clustering of high dimensional data for data mining applications". In SIGMOD Conference, pages 94–105. 1998.
C. -H. Cheng, A. W . Fu, and Y. Zhang, "Entropy-based subspace clustering for mining numerical data". In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 84–93, New York, NY, USA. ACM Press. 1999
I. S. Dhillon, S. Mallela, and D. S Modha, "Information-theoretic co-clustering". In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 89–98, New York, NY, USA. ACM Press. 2003
Hans-Peter Kriegel, Peer Kröger,, Arthur Zimek , "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering", ACM Transactions on Knowledge Discovery from Data (New York, NY: ACM) 3 (1): 1–58, 2009. doi:10. 1145/1497577. 1497578
Y. Cheng, and G. M. Church, "Biclustering of expression data". In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 93–103. AAAI Press. 2000
Charu C. Aggarwal, Joel L. Wolf, Philip S . Yu, Cecilia Procopiuc, Jong Soo Park, "Fast algorithms for projected clustering", ACM SIGMOD Record (New York, NY: ACM) 28 (2): 61–72, 1999. doi:10. 1145/304181. 304188
Christian Böhm, Karin Kailing, Hans-Peter Kriegel, Peer Kröger, "Density Connected Clustering with Local Subspace Preferences", Data Mining, IEEE International Conference on (Los Alamitos, CA, USA: IEEE Computer Society): 24–34, 2004. doi:10. 1109/ICDM. 2004. 10087, ISBN 0-7695-2142-8
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, "Automatic Subspace Clustering of High Dimensional Data", Data Mining and Knowledge Discovery (Springer Netherlands) 11 (1): 5–33, 2005. doi:10. 1007/s10618-005-1396-1
A. Zimek, "Clustering High-Dimensional Data", In C. C. Aggarwal, C. K. Reddy (ed. ): Data Clustering: Algorithms and Applications, CRC Press: 201–230, 2013.
E. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, H. -P. Kriegel, "Density-based Projected Clustering over High Dimensional Data Streams", In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 987–998, 2012.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Clustering high dimensional data summarizing analyzing clusters