您好,登錄后才能下訂單哦!
這篇文章將為大家詳細講解有關Hive優化的示例分析,小編覺得挺實用的,因此分享給大家做個參考,希望大家閱讀完這篇文章后可以有所收獲。
一、什么時候可以避免執行MapReduce?select *where語句中只有分區字段
二、Jion優化驅動表最右邊查詢表表的大小從左邊到右邊依次增大標志機制顯示的告知查詢優化器哪張表示大表/*+streamtable(table_name)*/
三、Map-side聚合sethive.map.aggr=true;這個設置可以將頂層的聚合操作放在Map階段執行,從而減輕清洗 階段數據傳輸和Reduce階段的執行時間,提升總體性能。缺點:該設置會消耗更多的內存。執行select count(1) from wlan;
四、Localhadoop本地模式SETmapred.job.tracker=local; 測試 select 1 from wlan limit 5;
下面兩個參數是local mr中常用的控制參數:
1,hive.exec.mode.local.auto.inputbytes.max設置local mr的最大輸入數據量,當輸入數據量小于這個值的時候會 采用local mr的方式
2,hive.exec.mode.local.auto.tasks.max設置local mr的最大輸入文件個數,當輸入文件個數小于這個值的時 候會采用local mr的方式默認執行模式:
hive (default)> select count(1) t1;
Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/
Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job -kill job_1433931422330_0001
Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1
2015-06-11 18:56:44,749 Stage-1 map = 0%, reduce = 0%
2015-06-11 18:56:57,029 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
2015-06-11 18:57:11,050 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.2 sec
MapReduce Total cumulative CPU time: 4 seconds 200 msec
Ended Job = job_1433931422330_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce:1 Cumulative CPU: 4.2 sec HDFS Read: 312 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 200 msec
OK
t1
100
Time taken: 46.573 seconds, Fetched: 1 row(s)
對比啟動Localhadoop模式:
hive (default)> select count(1) t1;
Automatically selecting local only mode for query
Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
setmapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0
2015-06-11 18:55:25,123 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1510342541_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 12HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t1
100
Time taken: 1.721 seconds, Fetched: 1 row(s)
五、索引Hive中的索引架構開放了一個接口,允許你根據這個接口去實現自己的索引。目前Hive自己有一個參考的索引實現(CompactIndex),后來在0.8版本中又加入位圖索引。這里就講講CompactIndex。
/*在index_test_table表的id字段上創建索引*/
create index idx on table index_test_table(id)
as 'org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler'with deferred rebuild;
alter index idx on index_test_table rebuild;
/*索引的剪裁。找到上面建的索引表,根據你最終要用的查詢條件剪裁一下。*/
/*如果你想跟RDBMS一樣建完索引就用,那是不行的,會直接報錯,這也是其麻煩的地方*/
create table my_index
as select _bucketname, `_offsets`
from default__index_test_table_idx__ where id = 10;
/*現在可以用索引了,注意最終查詢條件跟上面的剪裁條件一致*/
set Hive.index.compact.file = /user/Hive/warehouse/my_index;
set Hive.input.format =org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat;
select count(*) from index_test_table where id = 10;
六、數據傾斜所謂數據傾斜,說的是由于數據分布不均勻,個別值集中占據大部分數據量,加上Hadoop的計算模式,導致計算資源不均勻引起性能下降。
傾斜分成group by造成的傾斜和join造成的傾斜:
一個是Hive.Map.aggr,默認值已經為true,意思是會做Map端的combiner。所以如果你的group by查詢只是做count(*)的話,其實是看不出傾斜效果的,但是如果你做的是count(distinct),那么還是會看出一點傾斜效果。
另一個參數是Hive.groupby. skewindata。這個參數的意思是做Reduce操作的時候,拿到的key并不是所有相同值給同一個Reduce,而是隨機分發,然后Reduce做聚合,做完之后再做一輪MR,拿前面聚合過的數據再算結果。所以這個參數其實跟Hive.Map.aggr做的是類似的事情,只是拿到Reduce端來做,而且要額外啟動一輪Job,所以其實不怎么推薦用,效果不明顯。
改寫SQL來優化
/*改寫前*/
select a, count(distinctb) as c from tbl group by a;
/*改寫后*/
select a, count(*) as c
from (selectdistinct a, b from tbl) group by a;
七、Job間并行
首先,在Hive生成的多個Job中,在有些情況下Job之間是可以并行的,典型的就是子查詢。當需要執行多個子查詢unionall或者join操作的時候,Job間并行就可以使用了。比如下面的代碼就是一個可以并行的場景示意:
hive>FROM t4
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
關于“Hive優化的示例分析”這篇文章就分享到這里了,希望以上內容可以對大家有一定的幫助,使各位可以學到更多知識,如果覺得文章不錯,請把它分享出去讓更多的人看到。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。