An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includ...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2017-02, Vol.28 (2), p.386-400
Hauptverfasser:	Zhuozhao Li, Haiying Shen, Ligon, Walter, Denton, Jeffrey
Format:	Artikel
Sprache:	eng
Schlagworte:	Architecture Clusters Completion time Computer architecture Data communication Data storage Data transmission Disks Distributed databases Facebook Failure rates Hadoop hybrid architecture Measurement Performance measurement Random access memory remote file system scale-out scale-up Workload Workloads
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	400
container_issue	2
container_start_page	386
container_title	IEEE transactions on parallel and distributed systems
container_volume	28
creator	Zhuozhao Li Haiying Shen Ligon, Walter Denton, Jeffrey
description	Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time, throughput and job failure rate.
doi_str_mv	10.1109/TPDS.2016.2573820
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_7480403</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7480403</ieee_id><sourcerecordid>2174474262</sourcerecordid><originalsourceid>FETCH-LOGICAL-c336t-f218247b3d5cd110b7269a187a2a608cdf288dd23e6a166ddf7053d59cccde293</originalsourceid><addsrcrecordid>eNo9kF1rwjAUhsvYYM7tB4zdBHZdzVfT9NKpmwOHgnodYnLqKtp0SQvz3y-i7CoH8rzn40mSZ4IHhOBiuF5OVgOKiRjQLGeS4pukR7JMppRIdhtrzLO0oKS4Tx5C2GNMeIZ5L9mNajT9bQ7O67ZyNXIlmkCodnVV75BGs9PWVxatjD5AummGi65FM22da9DIm--qBdN2HtCbDmBRzC_Bl84fdW0AfYEO8fMIdRsek7tSHwI8Xd9-snmfrsezdL74-ByP5qlhTLRpGdelPN8ymxkbD9vmVBSayFxTLbA0tqRSWksZCE2EsLbMcRbhwhhjgRasn7xe-jbe_XQQWrV3na_jSEVJznnOqaCRIhfKeBeCh1I1vjpqf1IEq7NPdfapzj7V1WfMvFwyFQD88zmXmGPG_gD2fnFw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2174474262</pqid></control><display><type>article</type><title>An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements</title><source>IEEE Electronic Library (IEL)</source><creator>Zhuozhao Li ; Haiying Shen ; Ligon, Walter ; Denton, Jeffrey</creator><creatorcontrib>Zhuozhao Li ; Haiying Shen ; Ligon, Walter ; Denton, Jeffrey</creatorcontrib><description>Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time, throughput and job failure rate.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2016.2573820</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Architecture ; Clusters ; Completion time ; Computer architecture ; Data communication ; Data storage ; Data transmission ; Disks ; Distributed databases ; Facebook ; Failure rates ; Hadoop ; hybrid architecture ; Measurement ; Performance measurement ; Random access memory ; remote file system ; scale-out ; scale-up ; Workload ; Workloads</subject><ispartof>IEEE transactions on parallel and distributed systems, 2017-02, Vol.28 (2), p.386-400</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c336t-f218247b3d5cd110b7269a187a2a608cdf288dd23e6a166ddf7053d59cccde293</citedby><cites>FETCH-LOGICAL-c336t-f218247b3d5cd110b7269a187a2a608cdf288dd23e6a166ddf7053d59cccde293</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7480403$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7480403$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhuozhao Li</creatorcontrib><creatorcontrib>Haiying Shen</creatorcontrib><creatorcontrib>Ligon, Walter</creatorcontrib><creatorcontrib>Denton, Jeffrey</creatorcontrib><title>An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time, throughput and job failure rate.</description><subject>Architecture</subject><subject>Clusters</subject><subject>Completion time</subject><subject>Computer architecture</subject><subject>Data communication</subject><subject>Data storage</subject><subject>Data transmission</subject><subject>Disks</subject><subject>Distributed databases</subject><subject>Facebook</subject><subject>Failure rates</subject><subject>Hadoop</subject><subject>hybrid architecture</subject><subject>Measurement</subject><subject>Performance measurement</subject><subject>Random access memory</subject><subject>remote file system</subject><subject>scale-out</subject><subject>scale-up</subject><subject>Workload</subject><subject>Workloads</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kF1rwjAUhsvYYM7tB4zdBHZdzVfT9NKpmwOHgnodYnLqKtp0SQvz3y-i7CoH8rzn40mSZ4IHhOBiuF5OVgOKiRjQLGeS4pukR7JMppRIdhtrzLO0oKS4Tx5C2GNMeIZ5L9mNajT9bQ7O67ZyNXIlmkCodnVV75BGs9PWVxatjD5AummGi65FM22da9DIm--qBdN2HtCbDmBRzC_Bl84fdW0AfYEO8fMIdRsek7tSHwI8Xd9-snmfrsezdL74-ByP5qlhTLRpGdelPN8ymxkbD9vmVBSayFxTLbA0tqRSWksZCE2EsLbMcRbhwhhjgRasn7xe-jbe_XQQWrV3na_jSEVJznnOqaCRIhfKeBeCh1I1vjpqf1IEq7NPdfapzj7V1WfMvFwyFQD88zmXmGPG_gD2fnFw</recordid><startdate>20170201</startdate><enddate>20170201</enddate><creator>Zhuozhao Li</creator><creator>Haiying Shen</creator><creator>Ligon, Walter</creator><creator>Denton, Jeffrey</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20170201</creationdate><title>An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements</title><author>Zhuozhao Li ; Haiying Shen ; Ligon, Walter ; Denton, Jeffrey</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c336t-f218247b3d5cd110b7269a187a2a608cdf288dd23e6a166ddf7053d59cccde293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Architecture</topic><topic>Clusters</topic><topic>Completion time</topic><topic>Computer architecture</topic><topic>Data communication</topic><topic>Data storage</topic><topic>Data transmission</topic><topic>Disks</topic><topic>Distributed databases</topic><topic>Facebook</topic><topic>Failure rates</topic><topic>Hadoop</topic><topic>hybrid architecture</topic><topic>Measurement</topic><topic>Performance measurement</topic><topic>Random access memory</topic><topic>remote file system</topic><topic>scale-out</topic><topic>scale-up</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhuozhao Li</creatorcontrib><creatorcontrib>Haiying Shen</creatorcontrib><creatorcontrib>Ligon, Walter</creatorcontrib><creatorcontrib>Denton, Jeffrey</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhuozhao Li</au><au>Haiying Shen</au><au>Ligon, Walter</au><au>Denton, Jeffrey</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2017-02-01</date><risdate>2017</risdate><volume>28</volume><issue>2</issue><spage>386</spage><epage>400</epage><pages>386-400</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time, throughput and job failure rate.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2016.2573820</doi><tpages>15</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1045-9219
ispartof	IEEE transactions on parallel and distributed systems, 2017-02, Vol.28 (2), p.386-400
issn	1045-9219 1558-2183
language	eng
recordid	cdi_ieee_primary_7480403
source	IEEE Electronic Library (IEL)
subjects	Architecture Clusters Completion time Computer architecture Data communication Data storage Data transmission Disks Distributed databases Facebook Failure rates Hadoop hybrid architecture Measurement Performance measurement Random access memory remote file system scale-out scale-up Workload Workloads
title	An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T09%3A46%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Exploration%20of%20Designing%20a%20Hybrid%20Scale-Up/Out%20Hadoop%20Architecture%20Based%20on%20Performance%20Measurements&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Zhuozhao%20Li&rft.date=2017-02-01&rft.volume=28&rft.issue=2&rft.spage=386&rft.epage=400&rft.pages=386-400&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2016.2573820&rft_dat=%3Cproquest_RIE%3E2174474262%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2174474262&rft_id=info:pmid/&rft_ieee_id=7480403&rfr_iscdi=true