Hive - a petabyte scale data warehouse using Hadoop

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Thusoo, Ashish, Sarma, Joydeep Sen, Jain, Namit, Zheng Shao, Chakka, Prasad, Ning Zhang, Antony, Suresh, Hao Liu, Murthy, Raghotham
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Companies Data warehouses Facebook Hardware Libraries Open source software Plugs Query processing Statistics Warehousing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1005
container_issue
container_start_page	996
container_title
container_volume
creator	Thusoo, Ashish Sarma, Joydeep Sen Jain, Namit Zheng Shao Chakka, Prasad Ning Zhang Antony, Suresh Hao Liu Murthy, Raghotham
description	The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.
doi_str_mv	10.1109/ICDE.2010.5447738
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5447738</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5447738</ieee_id><sourcerecordid>5447738</sourcerecordid><originalsourceid>FETCH-LOGICAL-c223t-6a861846e59bdfbbafa51ceda1e4cc08130490cc4ece34eed1df247bc30928c33</originalsourceid><addsrcrecordid>eNpFj81Kw0AUhcc_MNY-gLiZF0idO3MnmSwlVlMouFHortzM3GikmpBJlb69AQuezeHjgwNHiBtQCwBV3K3Kh-VCqwktYp4bdyKuADXihAinItEmt6nS2ebsX9jNuUhAZSbNjNOXYh7jh5pSIIBViTBV-80ylSR7Hqk-jCyjpx3LQCPJHxr4vdtHlvvYfr3JikLX9dfioqFd5PmxZ-L1cflSVun6-WlV3q9Tr7UZ04xcBg4ztkUdmrqmhix4DgSM3isHRmGhvEf2bJA5QGg05rU3qtDOGzMTt3-7LTNv-6H9pOGwPZ43v4feSNw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Hive - a petabyte scale data warehouse using Hadoop</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Thusoo, Ashish ; Sarma, Joydeep Sen ; Jain, Namit ; Zheng Shao ; Chakka, Prasad ; Ning Zhang ; Antony, Suresh ; Hao Liu ; Murthy, Raghotham</creator><creatorcontrib>Thusoo, Ashish ; Sarma, Joydeep Sen ; Jain, Namit ; Zheng Shao ; Chakka, Prasad ; Ning Zhang ; Antony, Suresh ; Hao Liu ; Murthy, Raghotham</creatorcontrib><description>The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.</description><identifier>ISSN: 1063-6382</identifier><identifier>ISBN: 142445445X</identifier><identifier>ISBN: 9781424454457</identifier><identifier>EISSN: 2375-026X</identifier><identifier>EISBN: 1424454441</identifier><identifier>EISBN: 1424454468</identifier><identifier>EISBN: 9781424454440</identifier><identifier>EISBN: 9781424454464</identifier><identifier>DOI: 10.1109/ICDE.2010.5447738</identifier><language>eng</language><publisher>IEEE</publisher><subject>Companies ; Data warehouses ; Facebook ; Hardware ; Libraries ; Open source software ; Plugs ; Query processing ; Statistics ; Warehousing</subject><ispartof>2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010, p.996-1005</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c223t-6a861846e59bdfbbafa51ceda1e4cc08130490cc4ece34eed1df247bc30928c33</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5447738$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5447738$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Thusoo, Ashish</creatorcontrib><creatorcontrib>Sarma, Joydeep Sen</creatorcontrib><creatorcontrib>Jain, Namit</creatorcontrib><creatorcontrib>Zheng Shao</creatorcontrib><creatorcontrib>Chakka, Prasad</creatorcontrib><creatorcontrib>Ning Zhang</creatorcontrib><creatorcontrib>Antony, Suresh</creatorcontrib><creatorcontrib>Hao Liu</creatorcontrib><creatorcontrib>Murthy, Raghotham</creatorcontrib><title>Hive - a petabyte scale data warehouse using Hadoop</title><title>2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)</title><addtitle>ICDE</addtitle><description>The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.</description><subject>Companies</subject><subject>Data warehouses</subject><subject>Facebook</subject><subject>Hardware</subject><subject>Libraries</subject><subject>Open source software</subject><subject>Plugs</subject><subject>Query processing</subject><subject>Statistics</subject><subject>Warehousing</subject><issn>1063-6382</issn><issn>2375-026X</issn><isbn>142445445X</isbn><isbn>9781424454457</isbn><isbn>1424454441</isbn><isbn>1424454468</isbn><isbn>9781424454440</isbn><isbn>9781424454464</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpFj81Kw0AUhcc_MNY-gLiZF0idO3MnmSwlVlMouFHortzM3GikmpBJlb69AQuezeHjgwNHiBtQCwBV3K3Kh-VCqwktYp4bdyKuADXihAinItEmt6nS2ebsX9jNuUhAZSbNjNOXYh7jh5pSIIBViTBV-80ylSR7Hqk-jCyjpx3LQCPJHxr4vdtHlvvYfr3JikLX9dfioqFd5PmxZ-L1cflSVun6-WlV3q9Tr7UZ04xcBg4ztkUdmrqmhix4DgSM3isHRmGhvEf2bJA5QGg05rU3qtDOGzMTt3-7LTNv-6H9pOGwPZ43v4feSNw</recordid><startdate>201003</startdate><enddate>201003</enddate><creator>Thusoo, Ashish</creator><creator>Sarma, Joydeep Sen</creator><creator>Jain, Namit</creator><creator>Zheng Shao</creator><creator>Chakka, Prasad</creator><creator>Ning Zhang</creator><creator>Antony, Suresh</creator><creator>Hao Liu</creator><creator>Murthy, Raghotham</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201003</creationdate><title>Hive - a petabyte scale data warehouse using Hadoop</title><author>Thusoo, Ashish ; Sarma, Joydeep Sen ; Jain, Namit ; Zheng Shao ; Chakka, Prasad ; Ning Zhang ; Antony, Suresh ; Hao Liu ; Murthy, Raghotham</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c223t-6a861846e59bdfbbafa51ceda1e4cc08130490cc4ece34eed1df247bc30928c33</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Companies</topic><topic>Data warehouses</topic><topic>Facebook</topic><topic>Hardware</topic><topic>Libraries</topic><topic>Open source software</topic><topic>Plugs</topic><topic>Query processing</topic><topic>Statistics</topic><topic>Warehousing</topic><toplevel>online_resources</toplevel><creatorcontrib>Thusoo, Ashish</creatorcontrib><creatorcontrib>Sarma, Joydeep Sen</creatorcontrib><creatorcontrib>Jain, Namit</creatorcontrib><creatorcontrib>Zheng Shao</creatorcontrib><creatorcontrib>Chakka, Prasad</creatorcontrib><creatorcontrib>Ning Zhang</creatorcontrib><creatorcontrib>Antony, Suresh</creatorcontrib><creatorcontrib>Hao Liu</creatorcontrib><creatorcontrib>Murthy, Raghotham</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Thusoo, Ashish</au><au>Sarma, Joydeep Sen</au><au>Jain, Namit</au><au>Zheng Shao</au><au>Chakka, Prasad</au><au>Ning Zhang</au><au>Antony, Suresh</au><au>Hao Liu</au><au>Murthy, Raghotham</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Hive - a petabyte scale data warehouse using Hadoop</atitle><btitle>2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)</btitle><stitle>ICDE</stitle><date>2010-03</date><risdate>2010</risdate><spage>996</spage><epage>1005</epage><pages>996-1005</pages><issn>1063-6382</issn><eissn>2375-026X</eissn><isbn>142445445X</isbn><isbn>9781424454457</isbn><eisbn>1424454441</eisbn><eisbn>1424454468</eisbn><eisbn>9781424454440</eisbn><eisbn>9781424454464</eisbn><abstract>The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.</abstract><pub>IEEE</pub><doi>10.1109/ICDE.2010.5447738</doi><tpages>10</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-6382
ispartof	2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010, p.996-1005
issn	1063-6382 2375-026X
language	eng
recordid	cdi_ieee_primary_5447738
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Companies Data warehouses Facebook Hardware Libraries Open source software Plugs Query processing Statistics Warehousing
title	Hive - a petabyte scale data warehouse using Hadoop
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T11%3A56%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Hive%20-%20a%20petabyte%20scale%20data%20warehouse%20using%20Hadoop&rft.btitle=2010%20IEEE%2026th%20International%20Conference%20on%20Data%20Engineering%20(ICDE%202010)&rft.au=Thusoo,%20Ashish&rft.date=2010-03&rft.spage=996&rft.epage=1005&rft.pages=996-1005&rft.issn=1063-6382&rft.eissn=2375-026X&rft.isbn=142445445X&rft.isbn_list=9781424454457&rft_id=info:doi/10.1109/ICDE.2010.5447738&rft_dat=%3Cieee_6IE%3E5447738%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1424454441&rft.eisbn_list=1424454468&rft.eisbn_list=9781424454440&rft.eisbn_list=9781424454464&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5447738&rfr_iscdi=true