Java 更改 Hadoop 中的文件拆分大小

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9678180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 06:03:20  来源:igfitidea点击:

Change File Split size in Hadoop

javahadoopmapreducedistributed-computing

提问by Ahmedov

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mbfile, which is the default split size for TextInputFormat, would take even several hours to be processed.

我在 HDFS 目录中有一堆小文件。虽然文件的体积相对较小,但每个文件的处理时间却是巨大的。也就是说,一个64mb文件,它是 的默认分割大小TextInputFormat,甚至需要几个小时来处理。

What I need to do, is to reduce the split size, so that I can utilize even more nodesfor a job.

我需要做的是减少拆分大小,以便我可以利用更多节点来完成工作。

So the question is, how is it possible to split the files by let's say 10kb? Do I need to implement my own InputFormatand RecordReaderfor this, or is there any parameter to set? Thanks.

所以问题是,让我们说如何拆分文件10kb?我需要实现我自己InputFormatRecordReader这一点,或有任何参数设置?谢谢。

采纳答案by Brainlag

The parameter mapred.max.split.sizewhich can be set per job individually is what you looking for. Don't change dfs.block.sizebecause this is global for HDFS and can lead to problems.

mapred.max.split.size可以为每个作业单独设置的参数正是您所需要的。不要更改,dfs.block.size因为这对于 HDFS 是全局的,并且可能会导致问题。

回答by Alexander Verbitsky

"Hadoop: The Definitive Guide", p. 202:

“Hadoop:权威指南”,p。202:

Given a set of files, how does FileInputFormat turn them into splits? FileInputFormat splits only large files. Here “large” means larger than an HDFS block.The split size is normally the size of an HDFS block.

给定一组文件,FileInputFormat 如何将它们变成拆分? FileInputFormat 仅拆分大文件。这里的“大”意味着比 HDFS 块大。拆分大小通常是 HDFS 块的大小。

So you should change size of HDFS block, but this is wrong way. Maybe you should try to review architecture of your MapReduce application.

所以你应该改变 HDFS 块的大小,但这是错误的方式。也许您应该尝试查看 MapReduce 应用程序的架构。

回答by Ahmedov

Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula:

Hadoop 权威指南,第 203 页“最大拆分大小默认为 Java long 类型可以表示的最大值。它只有在小于块大小时才有效,强制拆分小于一个块。分割大小由以下公式计算:

max(minimumSize, min(maximumSize, blockSize))

by default

默认情况下

minimumSize < blockSize < maximumSize

so the split size is blockSize

所以分割大小是 blockSize

For example,

例如,

Minimum Split Size 1
Maximum Split Size 32mb
Block Size  64mb
Split Size  32mb

Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.

Hadoop 在处理少量大文件时比处理大量小文件时效果更好。一个原因是 FileInputFormat 以这样一种方式生成拆分,即每个拆分都是单个文件的全部或一部分。如果文件非常小(“小”意味着明显小于一个 HDFS 块)并且有很多,那么每个 map 任务将处理非常少的输入,并且会有很多(每个文件一个),每一个都会带来额外的簿记开销。比较分成 16 个 64mb 块的 1gb 文件和 10.000 左右的 100kb 文件。10.000 个文件每个使用一个映射,工作时间可能比具有单个输入文件和 16 个映射任务的等效文件慢数十或数百倍。



回答by Mahendran Ponnusamy

Write a custom input format which extends combinefileinputformat[has its own pros nad cons base don the hadoop distribution]. which combines the input splits into the value specified in mapred.max.split.size

编写一个自定义输入格式,它扩展了 combinefileinputformat [在 hadoop 发行版上有自己的优点和缺点]。它将输入拆分为 mapred.max.split.size 中指定的值

回答by Roman Nikitchenko

Here is fragment which illustrates correct way to do what is needed here without magic configuration strings. Needed constant is defined inside FileInputFormat. Block size can be taken if needed from default HDFS block constant but it has pretty good probability to be user defined.

这是片段,它说明了在没有魔术配置字符串的情况下执行此处所需操作的正确方法。需要的常量在里面定义FileInputFormat。如果需要,可以从默认的 HDFS 块常量中获取块大小,但它很有可能是用户定义的。

Here I just divide maximum split size by 2 if it was defined.

在这里,如果定义了最大分割大小,我只是将其除以 2。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

// ....

final long DEFAULT_SPLIT_SIZE = 128 * 1024 * 1024;
final Configuration conf = ...

// We need to lower input block size by factor of two.
conf.setLong(
    FileInputFormat.SPLIT_MAXSIZE,
    conf.getLong(
        FileInputFormat.SPLIT_MAXSIZE, DEFAULT_SPLIT_SIZE) / 2);