Java 增加 Hadoop 2 中 Hive 映射器的数量

Question

提问by Marsellus Wallace

I created a HBase table from Hive and I'm trying to do a simple aggregation on it. This is my Hive query:

我从 Hive 创建了一个 HBase 表，我正在尝试对其进行简单的聚合。这是我的 Hive 查询：

from my_hbase_table 
select col1, count(1) 
group by col1;

The map reduce job spawns only 2 mappers and I'd like to increase that. With a plain map reduce job I would configure the yarn and mapper memory to increase the number of mappers. I tried the following in Hive but it did not work:

地图减少工作只产生 2 个映射器，我想增加它。使用普通的 map reduce 作业，我将配置纱线和映射器内存以增加映射器的数量。我在 Hive 中尝试了以下操作，但没有奏效：

set yarn.nodemanager.resource.cpu-vcores=16;
set yarn.nodemanager.resource.memory-mb=32768;
set mapreduce.map.cpu.vcores=1;
set mapreduce.map.memory.mb=2048;

NOTE:

笔记：

My test cluster has only 2 nodes
The HBase table has more than 5M records
Hive logs show HiveInputFormat and a number of splits=2

我的测试集群只有 2 个节点
HBase表有超过5M的记录
Hive 日志显示 HiveInputFormat 和许多拆分 = 2

Answer 1

回答by Partha Kaushik

Reduce the input split size from the default value. The mappers will get increased.

从默认值减少输入拆分大小。映射器将增加。

SET mapreduce.input.fileinputformat.split.maxsize;

Answer 2

回答by Sandeep Singh

Split the file lesser then default value is not a efficient solution. Spiting is basically used during dealing with large dataset. Default value is itself a small size so its not worth to split it again.

拆分文件小于默认值不是一个有效的解决方案。Spiting 主要用于处理大型数据集。默认值本身就是一个小尺寸，所以不值得再次拆分它。

I would recommend following configuration before your query.You can apply it based upon your input data.

我建议您在查询之前进行以下配置。您可以根据您的输入数据应用它。

set hive.merge.mapfiles=false;

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

set mapred.map.tasks = XX;

If you want to assign number of reducer also then you can use below configuration

如果您还想分配减速器的数量，则可以使用以下配置

set mapred.reduce.tasks = XX;

Note that on Hadoop 2 (YARN), the mapred.map.tasksand mapred.reduce.tasksare deprecated and are replaced by other variables:

请注意，在 Hadoop 2 (YARN) 上，mapred.map.tasks和mapred.reduce.tasks已弃用并由其他变量替换：

mapred.map.tasks     -->    mapreduce.job.maps
mapred.reduce.tasks  -->    mapreduce.job.reduces

Please refer below useful link related to this

请参考以下与此相关的有用链接

http://answers.mapr.com/questions/5336/limit-mappers-and-reducers-for-specific-job.html

Fail to Increase Hive Mapper Tasks?

无法增加 Hive Mapper 任务？

How mappers get assigned

映射器如何分配

Number of mappers is determined by the number of splits determined by the InputFormat used in the MapReduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes.

映射器的数量由 MapReduce 作业中使用的 InputFormat 确定的拆分数量决定。在典型的 InputFormat 中，它与文件数量和文件大小成正比。

suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then it will occupy 2 block and then 2 mapper will get assigned based on the blocks

假设您的 HDFS 块配置配置为 64MB（默认大小），并且您有一个 100MB 大小的文件，那么它将占用 2 个块，然后将根据块分配 2 个映射器

but suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigend based on that.

但假设你有 2 个 30MB 大小的文件（每个文件），那么每个文件将占用一个块，映射器将基于此获得分配。

When you are working with a large number of small files, Hive uses CombineHiveInputFormat by default. In terms of MapReduce, it ultimately translates to using CombineFileInputFormat that creates virtual splits over multiple files, grouped by common node, rack when possible. The size of the combined split is determined by

当您处理大量小文件时，Hive 默认使用 CombineHiveInputFormat。就 MapReduce 而言，它最终转化为使用 CombineFileInputFormat 来创建多个文件的虚拟拆分，并在可能的情况下按公共节点和机架分组。组合分割的大小由下式决定

mapred.max.split.size
or 
mapreduce.input.fileinputformat.split.maxsize ( in yarn/MR2);

So if you want to have less splits(less mapper) you need to set this parameter higher.

因此，如果您想要更少的拆分（更少的映射器），则需要将此参数设置得更高。

This link can be useful to understand more on it.

此链接对于了解更多信息很有用。

What is the default size that each Hadoop mapper will read?

每个 Hadoop 映射器将读取的默认大小是多少？

Also number of mappers and reducers are always dependent of available mapper and reducer slots of your cluster.

此外，映射器和化简器的数量始终取决于集群的可用映射器和化简器插槽。

Answer 3

回答by Venkat

Splitting the HBase table should get your job to use more mappers automatically.

拆分 HBase 表应该可以让您的工作自动使用更多映射器。

Since you have 2 splits each split is read by one mapper. Increase no. of splits.

由于您有 2 个拆分，每个拆分由一个映射器读取。增加编号的分裂。

Java 增加 Hadoop 2 中 Hive 映射器的数量

提问by Marsellus Wallace

回答by Partha Kaushik

回答by Sandeep Singh

回答by Venkat

相关推荐

最近更新

标签

Java 增加 Hadoop 2 中 Hive 映射器的数量

提问by Marsellus Wallace

回答by Partha Kaushik

回答by Sandeep Singh

回答by Venkat

相关推荐

JavaFX 文本字段的值更改侦听器

Java 如何自动化Kafka测试

Java 将数据从 mysql db 检索到文本字段

Java 如何使用properties-maven-plugin？

相关推荐

最近更新

标签