java 如何计算Hadoop Map-Reduce中一组数据的中心移动平均值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12455088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 08:57:34  来源:igfitidea点击:

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

javahadoopmapreduce

提问by saurabh shashank

I want to calculate Centered Moving average of a set of Data .

我想计算一组 Data 的中心移动平均值。

Example Input format :

Example Input format :

quarter | sales      
Q1'11   | 9            
Q2'11   | 8
Q3'11   | 9
Q4'11   | 12
Q1'12   | 9
Q2'12   | 12
Q3'12   | 9
Q4'12   | 10

Mathematical Representation of data and calculation of Moving average and then centered moving average

数据的数学表示和移动平均线的计算,然后是中心移动平均线

Period   Value   MA  Centered
1          9
1.5
2          8
2.5              9.5
3          9            9.5
3.5              9.5
4          12           10.0
4.5              10.5
5          9            10.750
5.5              11.0
6          12
6.5
7          9  

I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter. The RecordReader Problem Question Thread Thanks

我坚持 RecordReader 的实施,它将提供一年的映射器销售价值,即四个季度。 RecordReader 问题问题线程谢谢

回答by Joe K

This is actually totally doable in the MapReduce paradigm; it does not have to be though of as a 'sliding window'. Instead think of the fact that each data point is relevant to a max of four MA calculations, and remember that each call to the map function can emit more than one key-value pair. Here is pseudo-code:

这在 MapReduce 范式中实际上是完全可行的;它不必被视为“滑动窗口”。相反,考虑每个数据点与最多四个 MA 计算相关的事实,并记住对 map 函数的每次调用可以发出多个键值对。这是伪代码:

First MR job:

map(quarter, sales)
    emit(quarter - 1.5, sales)
    emit(quarter - 0.5, sales)
    emit(quarter + 0.5, sales)
    emit(quarter + 1.5, sales)

reduce(quarter, list_of_sales)
    if (list_of_sales.length == 4):
        emit(quarter, average(list_of_sales))
    endif


Second MR job:

map(quarter, MA)
    emit(quarter - 0.5, MA)
    emit(quarter + 0.5, MA)

reduce(quarter, list_of_MA)
    if (list_of_MA.length == 2):
        emit(quarter, average(list_of_sales))
    endif

回答by David Gruzman

In best of my understanding moving average is not nicely maps to MapReduce paradigm since its calculation is essentially "sliding window" over sorted data, while MR is processing of non-intersected ranges of sorted data.
Solution I do see is as following:
a) To implement custom partitioner to be able to make two different partitions in two runs. In each run your reducers will get different ranges of data and calculate moving average where approprieate
I will try to illustrate:
In first run data for reducers should be:
R1: Q1, Q2, Q3, Q4
R2: Q5, Q6, Q7, Q8
...

在我的理解中,移动平均并不能很好地映射到 MapReduce 范式,因为它的计算本质上是对排序数据的“滑动窗口”,而 MR 正在处理排序数据的非相交范围。
我看到的解决方案如下:
a) 实现自定义分区器,以便能够在两次运行中创建两个不同的分区。在每次运行中,您的减速器将获得不同范围的数据并计算适当的移动平均值
我将尝试说明:
减速器的第一次运行数据应为:
R1:Q1、Q2、Q3、Q4
R2:Q5、Q6、Q7、Q8
...

here you will cacluate moving average for some Qs.

在这里,您将计算某些 Q 的移动平均线。

In next run your reducers should get data like: R1: Q1...Q6
R2: Q6...Q10
R3: Q10..Q14

在下一次运行中,您的减速器应该得到如下数据: R1: Q1...Q6
R2: Q6...Q10
R3: Q10..Q14

And caclulate the rest of moving averages.
Then you will need to aggregate results.

并计算其余的移动平均线。
然后,您将需要汇总结果。

Idea of custom partitioner that it will have two modes of operation - each time dividing into equal ranges but with some shift. In a pseudocode it will look like this :
partition = (key+SHIFT) / (MAX_KEY/numOfPartitions) ;
where: SHIFT will be taken from the configuration.
MAX_KEY = maximum value of the key. I assume for simplicity that they start with zero.

自定义分区器的想法是它将有两种操作模式 - 每次划分为相等的范围但有一些变化。在伪代码中,它看起来像这样:
partition = (key+SHIFT) / (MAX_KEY/numOfPartitions) ;
其中: SHIFT 将从配置中获取。
MAX_KEY = 键的最大值。为简单起见,我假设它们从零开始。

RecordReader, IMHO is not a solution since it is limited to specific split and can not slide over split's boundary.

RecordReader,恕我直言不是解决方案,因为它仅限于特定的拆分并且不能滑过拆分的边界。

Another solution would be to implement custom logic of splitting input data (it is part of the InputFormat). It can be done to do 2 different slides, similar to partitioning.

另一种解决方案是实现拆分输入数据的自定义逻辑(它是 InputFormat 的一部分)。可以做2张不同的幻灯片,类似于分区。