Java 使用 MapReduce/Hadoop 对大数据进行排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3624384/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sorting large data using MapReduce/Hadoop
提问by Chander Shivdasani
I am reading about MapReduce and the following thing is confusing me.
我正在阅读有关 MapReduce 的内容,以下内容让我感到困惑。
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
假设我们有一个包含 100 万个条目(整数)的文件,我们想使用 MapReduce 对它们进行排序。我理解的方法如下:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
编写一个对整数进行排序的映射器函数。所以框架会将输入文件分成多个块,并将它们提供给不同的映射器。每个映射器将彼此独立地对它们的数据块进行排序。一旦所有的映射器都完成了,我们将把它们的每个结果传递给 Reducer,它会组合结果并给我最终的输出。
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
我的疑问是,如果我们有一个 reducer,那么它如何利用分布式框架,如果最终我们必须将结果合并到一个地方?。问题深入到在一个地方合并 100 万个条目。是这样还是我错过了什么?
Thanks, Chander
谢谢,钱德
采纳答案by Peter Tillemans
Check out merge-sort.
查看归并排序。
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
事实证明,在操作和内存消耗方面,对部分排序的列表进行排序比对完整列表进行排序要高效得多。
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
如果 reducer 得到 4 个排序列表,它只需要查找 4 个列表中最小的元素并选择那个。如果列表的数量是恒定的,那么这个减少是一个 O(N) 操作。
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.
通常,reducer 也“分布”在树之类的东西中,因此工作也可以并行化。
回答by Gopi
I think, combining multiple sorteditems is efficient than combining multiple unsorteditems. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.
我认为,组合多个已排序的项目比组合多个未排序的项目更有效。所以映射器完成对块进行排序的任务,并且化简器将它们合并。如果映射器没有完成排序,reducer 将很难进行排序。
回答by SquareCog
As others have mentioned, merging is much simpler than sorting, so there's a big win there.
正如其他人提到的,合并比排序简单得多,所以这是一个很大的胜利。
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
但是,在巨大的数据集上执行 O(N) 串行操作也可能令人望而却步。正如您正确指出的那样,最好也找到一种并行合并的方法。
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.
一种方法是将随机分区器(这是通常使用的)中的分区函数替换为更智能的东西。例如,Pig 为此所做的是对您的数据集进行采样,以得出值分布的粗略近似值,然后将值范围分配给不同的减速器。Reducer 0 获取所有元素 < 1000,Reducer 1 获取所有元素 >= 1000 和 < 5000,依此类推。然后你可以并行进行合并,最终结果按照你知道每个 reducer 任务的数量进行排序。
回答by rOrlig
So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
因此,使用 map-reduce 进行排序的最简单方法(尽管不是最有效的方法)是执行以下操作
During the Map Phase (Input_Key, Input_Value) emit out (Input_Value,Input Key)
在映射阶段 (Input_Key, Input_Value) 发出 (Input_Value,Input Key)
Reducer is an Identity Reduceer
Reducer 是 Identity Reducer
So for example if our data is a student, age database then your mapper input would be ('A', 1) ('B',2) ('C', 10) ... and the output would be (1, A) (2, B) (10, C)
例如,如果我们的数据是学生、年龄数据库,那么您的映射器输入将是 ('A', 1) ('B',2) ('C', 10) ...输出将是 (1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.
还没有尝试过这个逻辑,但它是我正在解决的家庭作业问题的一步。将放一个更新源代码/逻辑链接。
回答by Alok Nayak
Sorry for being late but for future readers, yes, Chander, you are missing something.
抱歉迟到了,但对于未来的读者,是的,钱德,你错过了一些东西。
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
逻辑上,Reducer 只能处理它运行的节点的混洗和排序数据。我的意思是在一个节点上运行的 reducer 不能查看其他节点的数据,它只对它的数据应用 reduce 算法。所以不能应用归并排序的归并程序。
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
所以对于大数据,我们使用 TeraSort,它只不过是带有自定义分区器的身份映射器和缩减器。您可以在此处阅读有关TeraSort 的 Hadoop 实现的更多信息。它指出:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N ? 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i ? 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
“TeraSort 是标准的映射/归约排序,除了使用 N ? 1 个采样键的排序列表的自定义分区器,这些键定义每个归约的键范围。特别是,所有键,例如 sample[i ? 1] <= key < sample[i] 被发送到reduce i。这保证了reduce i 的输出都小于reduce i+1 的输出。”
回答by pr-pal
Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
使用 MapReduce 可以有效地实现排序。但是您似乎正在考虑使用 mapreduce 实现合并排序来实现此目的。它可能不是理想的候选人。
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
就像您提到的,归并排序(使用 map-reduce)将涉及以下步骤:
- Partition the elements into small groups and assign each group to the mappers in round robin manner
- Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
- Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
- 将元素分成小组并以循环方式将每个组分配给映射器
- 每个映射器都会对子集进行排序并返回 {K, {subset}},其中所有映射器的 K 都相同
- 由于所有映射器都使用相同的 K,因此只有一个reduce,因此只有一个reducer。reducer 可以合并数据并返回排序后的结果
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
这里的问题是,正如您提到的,在减少阶段只能有一个减少并行性的减少器。就像其他回复中提到的那样,出于此目的,可以考虑使用像 terasort 这样的 mapreduce 特定实现。
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
在http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf找到了说明
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers
回到归并排序,如果 hadoop(或等效的)工具提供减速器的层次结构,其中一级减速器的输出进入下一级减速器或将其循环回同一组减速器,这将是可行的