Performance modeling and analysis of a hadoop cluster for efficient big data processing

Jong Beom Lim, Jong Suk Ahn, Kang Woo Lee

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Although Apache Hadoop, an open-source implementation of the MapReduce programming model in Java, has become a popular big data framework, it is important to understand the challenges of using Hadoop for varying input data sizes, and how efficient is a Hadoop cluster with configurations. In this regard, there is a need to understand the impact of Hadoop implementation for data-parallel programming model on the performance of big data processing. In this paper, we design a performance model of a Hadoop cluster with consideration of the number of Map and Reduce tasks. Because each Hadoop cluster has its own characteristics and system parameters, it is not enough to use default settings of Hadoop configurations Furthermore, we present performance analysis based on real-world environments using cloud computing. With various performance evaluations, we identified a performance tradeoff between the number of Map and Reduce tasks and processing times of a job. Based on our observations for big data jobs with varying input data sizes, we formulated a performance model for a Hadoop cluster not only in a microscopicyview but also:inEasmacroscopic view. Our performance model for Hadoop clusters help estimate.the.processingnrateaand2theDaverage1processing0time for given input dataset sizes, and choose suitable configurations,:which influencecoveralliHadoopiclusters’ performance largely.

Original languageEnglish
Pages (from-to)2314-2319
Number of pages6
JournalAdvanced Science Letters
Volume22
Issue number9
DOIs
StatePublished - Sep 2016

Keywords

  • Big data
  • Hadoop
  • MapReduce
  • Performance model

Fingerprint

Dive into the research topics of 'Performance modeling and analysis of a hadoop cluster for efficient big data processing'. Together they form a unique fingerprint.

Cite this