Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Satish Gopalani; Rohan Arora

Research Article

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

by Satish Gopalani, Rohan Arora

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 113 - Issue 1

Published: March 2015

Authors: Satish Gopalani, Rohan Arora

10.5120/19788-0531

PDF

Satish Gopalani, Rohan Arora . Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications. 113, 1 (March 2015), 8-11. DOI=10.5120/19788-0531

                        @article{ 10.5120/19788-0531,
                        author  = { Satish Gopalani,Rohan Arora },
                        title   = { Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means },
                        journal = { International Journal of Computer Applications },
                        year    = { 2015 },
                        volume  = { 113 },
                        number  = { 1 },
                        pages   = { 8-11 },
                        doi     = { 10.5120/19788-0531 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2015
                        %A Satish Gopalani
                        %A Rohan Arora
                        %T Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means%T 
                        %J International Journal of Computer Applications
                        %V 113
                        %N 1
                        %P 8-11
                        %R 10.5120/19788-0531
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Big Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark – both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K-Means).

References

Apache Hadoop Documentation 2014 http://hadoop. apache. org/.
Shvachko K. , Hairong Kuang, Radia S, Chansler, R The Hadoop Distributed File System Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
HortonWorks documentation 2014 http://docs. hortonworks. com/HDPDocuments/HDP1/HDP-1. 2. 4/bk_getting-started-guide/content/ch_hdp1_getting_started_chp2_1. html
Apache Spark documentation 2014 https://spark. apache. org/documentation. html.
Apache Spark Research 2014 https://spark. apache. org/research. html.
Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, 2011
Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Shark: SQL and Rich Analytics at Scale. SIGMOD 2013. June 2013.
Tom White, Hadoop the definitive guide chapter 06
Spark Internals - Spark Summit 2014 http://spark-summit. org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson. pdf
Spark Job Flow – Databricks https://databricks-training. s3. amazonaws. com/slides/advanced-spark-training. pdf
Aaron Davidson, Andrew Or. Optimizing Shuffle Performance in Spark. Technical Report http://www. cs. berkeley. edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report. pdf
Machine Learning, Wikipedia, 2014 http://en. wikipedia. org/wiki/Machine_learning
Machine learning with Spark - Spark Summit 2013 https://spark-summit. org/2013/exercises/machine-learning-with-spark. html

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Big data Hadoop HDFS Map Reduce Spark Mahout MLib Machine learning K-Means.