函数式编程(特别是 Scala 和 Scala API)中的 reduce 和 foldLeft/fold 之间的区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25158780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?
提问by samthebest
Why do Scala and frameworks like Spark and Scalding have both reduceand foldLeft? So then what's the difference between reduceand fold?
为什么 Scala 和像 Spark 和 Scalding 这样的框架都有reduce和foldLeft?那么reduce和之间有什么区别fold呢?
回答by samthebest
reduce vs foldLeft
减少与左折叠
A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduceshould be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.
在与此主题相关的任何其他 stackoverflow 答案中都没有明确提到的一个很大的区别是,reduce应该给出一个可交换的幺半群,即一个既可交换又可结合的操作。这意味着操作可以并行化。
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduceeven exists. The collection can be chopped up and the reducecan operate on each chunk, then the reducecan operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.
这种区别对于大数据/MPP/分布式计算非常重要,reduce甚至存在的全部原因。集合可以被切碎并且reduce可以对每个块进行操作,然后可以对每个块reduce的结果进行操作——实际上分块的级别不需要停止一层深。我们也可以切碎每一块。这就是为什么在给定无限数量的 CPU 的情况下,对列表中的整数求和是 O(log N) 的原因。
If you just look at the signatures there is no reason for reduceto exist because you can achieve everything you can with reducewith a foldLeft. The functionality of foldLeftis a greater than the functionality of reduce.
如果你只看签名,就没有reduce存在的理由,因为你可以reduce用foldLeft. 的功能foldLeft大于 的功能reduce。
Butyou cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is nota commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.
但是你不能并行化 a foldLeft,所以它的运行时间总是 O(N) (即使你输入一个可交换的幺半群)。这是因为假设操作不是可交换的幺半群,因此累积值将由一系列顺序聚合计算。
foldLeftdoes not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.
foldLeft不假设交换性或结合性。关联性提供了拆分集合的能力,而交换性使累积变得容易,因为顺序并不重要(因此聚合每个块的每个结果的顺序并不重要)。严格来说,交换性不是并行化所必需的,例如分布式排序算法,它只是使逻辑更容易,因为您不需要给块排序。
If you have a look at the Spark documentation for reduceit specifically says "... commutative and associative binary operator"
如果您查看 Spark 文档,reduce它会专门说“...可交换和关联二元运算符”
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
Here is proof that reduceis NOT just a special case of foldLeft
这里的证明reduce不仅仅是一个特例foldLeft
scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par
scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds
scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds
reduce vs fold
减少与折叠
Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.
现在这是它更接近 FP / 数学根源的地方,并且解释起来有点棘手。Reduce 被正式定义为 MapReduce 范式的一部分,它处理无序集合(多集),Fold 是根据递归(参见 catamorphism)正式定义的,因此为集合假定了一个结构/序列。
There is no foldmethod in Scalding because under the (strict) Map Reduce programming model we cannot define foldbecause chunks do not have an ordering and foldonly requires associativity, not commutativity.
fold在 Scalding 中没有方法,因为在(严格的)Map Reduce 编程模型下我们无法定义,fold因为块没有排序并且fold只需要关联性,而不是交换性。
Put simply, reduceworks without an order of cumulation, foldrequires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduceshouldwork on an empty collection, because its zero value can by deduced by taking an arbitrary value xand then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.
简而言之,reduce在没有累积顺序的情况下工作,fold需要累积顺序,并且正是该累积顺序需要零值而不是零值的存在区分它们。严格来说reduce应该适用于空集合,因为它的零值可以通过取任意值x然后求解来推导出x op y = x,但这不适用于非交换运算,因为可能存在不同的左右零值(即x op y != y op x)。当然,Scala 不会费心计算这个零值是什么,因为这需要做一些数学运算(这可能是无法计算的),所以只是抛出一个异常。
It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reducehas become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.
似乎(在词源学中经常出现这种情况)这种原始的数学含义已经丢失,因为编程中唯一明显的区别是签名。结果是reduce已经成为 的同义词fold,而不是保留它在 MapReduce 中的原始含义。现在这些术语经常互换使用,并且在大多数实现中表现相同(忽略空集合)。怪异会被特殊性加剧,就像在 Spark 中一样,我们现在将解决这些问题。
So Spark doeshave a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that folduses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReducebut no treeFold.
所以 Spark确实有一个fold,但是子结果(每个分区一个)的组合顺序(在撰写本文时)与任务完成的顺序相同 - 因此是不确定的。感谢@CafeFeed 指出fold使用runJob,在阅读代码后我意识到它是不确定的。Spark 有一个treeReduce但没有treeFold.
Conclusion
结论
There is a difference between reduceand foldeven when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.
之间存在差异reduce和fold甚至应用于非空序列时。前者被定义为具有任意顺序的集合的 MapReduce 编程范式的一部分 ( http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) 并且应该假设运算符除了是可交换的关联以给出确定性结果。后者是根据原态定义的,并且要求集合具有序列的概念(或递归定义,如链表),因此不需要可交换运算符。
In practice due to the unmathematical nature of programming, reduceand foldtend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).
在实践中,由于编程的非数学性质,reduce并且fold倾向于以相同的方式运行,无论是正确的(如在 Scala 中)还是错误的(如在 Spark 中)。
Extra: My Opinion On the Spark API
额外:我对 Spark API 的看法
My opinion is that confusion would be avoided if use of the term foldwas completely dropped in Spark. At least spark does have a note in their documentation:
我的观点是,如果fold在 Spark 中完全放弃使用该术语,就可以避免混淆。至少 spark 在他们的文档中有一个注释:
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.
这与在 Scala 等函数式语言中为非分布式集合实现的折叠操作有些不同。
回答by Mishael Rosenthal
If I am not mistaken, even though the Spark API does not require it, fold also requires for the f to be commutative. Because the order in which the partitions will be aggregated is not assured. For example in the following code only the first print out is sorted:
如果我没记错的话,即使 Spark API 不需要它, fold 也要求 f 是可交换的。因为无法确定聚合分区的顺序。例如,在下面的代码中,只有第一个打印输出被排序:
import org.apache.spark.{SparkConf, SparkContext}
object FoldExample extends App{
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Simple Application")
implicit val sc = new SparkContext(conf)
val range = ('a' to 'z').map(_.toString)
val rdd = sc.parallelize(range)
println(range.reduce(_ + _))
println(rdd.reduce(_ + _))
println(rdd.fold("")(_ + _))
}
Print out:
打印:
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcghituvjklmwxyzqrsdefnop
abcghituvjklmwxyzqrsdefnop
defghinopjklmqrstuvabcwxyz
defghinopjklmqrstuvabcwxyz
回答by Mishael Rosenthal
foldin Apache Spark is not the same as foldon not-distributed collections. In fact it requires commutative functionto produce deterministic results:
fold在 Apache Spark 中与fold在非分布式集合中不同。事实上,它需要交换函数来产生确定性的结果:
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
这与在 Scala 等函数式语言中为非分布式集合实现的折叠操作有些不同。这种折叠操作可以单独应用于分区,然后将这些结果折叠成最终结果,而不是按照某些定义的顺序将折叠依次应用于每个元素。对于不可交换的函数,结果可能与应用于非分布式集合的折叠的结果不同。
This has been shownby Mishael Rosenthaland suggested by Make42in his comment.
It's been suggestedthat observed behavior is related to HashPartitionerwhen in fact parallelizedoesn't shuffle and doesn't use HashPartitioner.
有人建议,观察到的行为与HashPartitioner实际上parallelize不洗牌和不使用HashPartitioner.
import org.apache.spark.sql.SparkSession
/* Note: standalone (non-local) mode */
val master = "spark://...:7077"
val spark = SparkSession.builder.master(master).getOrCreate()
/* Note: deterministic order */
val rdd = sc.parallelize(Seq("a", "b", "c", "d"), 4).sortBy(identity[String])
require(rdd.collect.sliding(2).forall { case Array(x, y) => x < y })
/* Note: all posible permutations */
require(Seq.fill(1000)(rdd.fold("")(_ + _)).toSet.size == 24)
Explained:
解释:
Structure of foldfor RDD
def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
var jobResult: T
val cleanOp: (T, T) => T
val foldPartition = Iterator[T] => T
val mergeResult: (Int, T) => Unit
sc.runJob(this, foldPartition, mergeResult)
jobResult
}
is the same as structure of reducefor RDD:
与RDD 的结构reduce相同:
def reduce(f: (T, T) => T): T = withScope {
val cleanF: (T, T) => T
val reducePartition: Iterator[T] => Option[T]
var jobResult: Option[T]
val mergeResult = (Int, Option[T]) => Unit
sc.runJob(this, reducePartition, mergeResult)
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
where runJobis performed with disregard of partition order and results in need of commutative function.
whererunJob在不考虑分区顺序的情况下执行并导致需要交换函数。
foldPartitionand reducePartitionare equivalent in terms of order of processing and effectively (by inheritance and delegation) implemented by reduceLeftand foldLefton TraversableOnce.
foldPartition并且reducePartition是在由实施的处理的顺序的术语和有效(通过继承和委派)等效reduceLeft和foldLeft上TraversableOnce。
Conclusion: foldon RDD cannot depend on order of chunks and needs commutativity and associativity.
结论:fold在 RDD 上不能依赖块的顺序,需要交换性和结合性。
回答by morazow
One other difference for Scalding is the use of combiners in Hadoop.
Scalding 的另一个区别是在 Hadoop 中使用组合器。
Imagine your operation is commutative monoid, with reduceit will be applied on the map side also instead of shuffling/sorting all data to reducers. With foldLeftthis is not the case.
想象一下,您的操作是可交换的幺半群,使用reduce它将也应用于映射端,而不是将所有数据混洗/排序到 reducer。对于foldLeft,情况并非如此。
pipe.groupBy('product) {
_.reduce('price -> 'total){ (sum: Double, price: Double) => sum + price }
// reduce is .mapReduceMap in disguise
}
pipe.groupBy('product) {
_.foldLeft('price -> 'total)(0.0){ (sum: Double, price: Double) => sum + price }
}
It is always good practice to define your operations as monoid in Scalding.
在 Scalding 中将操作定义为幺半群始终是一个好习惯。

