Research Article

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

by  Satish Gopalani, Rohan Arora
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 113 - Issue 1
Published: March 2015
Authors: Satish Gopalani, Rohan Arora
10.5120/19788-0531
PDF

Satish Gopalani, Rohan Arora . Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications. 113, 1 (March 2015), 8-11. DOI=10.5120/19788-0531

                        @article{ 10.5120/19788-0531,
                        author  = { Satish Gopalani,Rohan Arora },
                        title   = { Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means },
                        journal = { International Journal of Computer Applications },
                        year    = { 2015 },
                        volume  = { 113 },
                        number  = { 1 },
                        pages   = { 8-11 },
                        doi     = { 10.5120/19788-0531 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2015
                        %A Satish Gopalani
                        %A Rohan Arora
                        %T Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means%T 
                        %J International Journal of Computer Applications
                        %V 113
                        %N 1
                        %P 8-11
                        %R 10.5120/19788-0531
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Big Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark – both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K-Means).

References
  • Apache Hadoop Documentation 2014 http://hadoop. apache. org/.
  • Shvachko K. , Hairong Kuang, Radia S, Chansler, R The Hadoop Distributed File System Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium
  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
  • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
  • HortonWorks documentation 2014 http://docs. hortonworks. com/HDPDocuments/HDP1/HDP-1. 2. 4/bk_getting-started-guide/content/ch_hdp1_getting_started_chp2_1. html
  • Apache Spark documentation 2014 https://spark. apache. org/documentation. html.
  • Apache Spark Research 2014 https://spark. apache. org/research. html.
  • Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, 2011
  • Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Shark: SQL and Rich Analytics at Scale. SIGMOD 2013. June 2013.
  • Tom White, Hadoop the definitive guide chapter 06
  • Spark Internals - Spark Summit 2014 http://spark-summit. org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson. pdf
  • Spark Job Flow – Databricks https://databricks-training. s3. amazonaws. com/slides/advanced-spark-training. pdf
  • Aaron Davidson, Andrew Or. Optimizing Shuffle Performance in Spark. Technical Report http://www. cs. berkeley. edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report. pdf
  • Machine Learning, Wikipedia, 2014 http://en. wikipedia. org/wiki/Machine_learning
  • Machine learning with Spark - Spark Summit 2013 https://spark-summit. org/2013/exercises/machine-learning-with-spark. html
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Big data Hadoop HDFS Map Reduce Spark Mahout MLib Machine learning K-Means.

Powered by PhDFocusTM