Hive优化

本地模式 set hive.exec.mode.local.auto=true 注意hive.exec.local.auto.inputbytes.max默认值为128M，表示加载文件的最大值，若大于该配置任会以集群方式运行
集群模式

Hive计算时，将小表放在join左边
Map Join：在Map端完成Join
两种实现方式：
- SQL语句
- 开启自动MapJoin
SQL 语句 SELECT /*+ MAPJOIN(smallTable)*/ smallTable.key, bigTable.value FROM smallTable JOIN bigTalble ON smallTable.key = bigTable.key
开启自动的MapJoin
- 通过修改set hive.auto.convert.join=true;开启
相关参数
- hive.mapjoin.smalltable.filesize判断是否为小表的阈值
- hive.ignore.mapjoin.hint是否忽略mapjoin标记

通过设置以下参数开启在Map端的聚合： set hive.map.aggr=true;
相关配置参数： hive.groupby.mapaggr.checkinterval： map端group by执行聚合时处理的多少行数据（默认：100000） hive.map.aggr.hash.min.reduction：进行聚合的最小比例（预先对100000条数据做聚合，若聚合之后的数据量/100000的值大于>该配置0.5，则不会聚合） hive.map.aggr.hash.percentmemory： map端聚合使用的内存的最大值 hive.map.aggr.hash.force.flush.memory.threshold： map端做聚合操作是hash表的最大可用内容，大于该值则会触发flush hive.groupby.skewindata 是否对GroupBy产生的数据倾斜做优化，默认为false

Optimization