Spark / Scala:将 RDD 传递给函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31040150/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark / Scala: Passing RDD to Function
提问by Jes
I am curious what exactly passing a RDD to a function does in Spark.
我很好奇在 Spark 中将 RDD 传递给函数到底是做什么的。
def my_func(x : RDD[String]) : RDD[String] = {
do_something_here
}
Suppose we define a function as above. When we call the function and pass an existing RDD[String] object as the input parameter, does this my_function make a "copy" for this RDD as the function parameter? In other words, is it being called-by-reference or called-by-value?
假设我们如上定义了一个函数。当我们调用函数并传递一个现有的 RDD[String] 对象作为输入参数时,这个 my_function 是否为这个 RDD 做一个“副本”作为函数参数?换句话说,它是按引用调用还是按值调用?
回答by marios
In Scala nothing get's copied (in the sense of pass-by-value you have in C/C++) when passed around. Most of the basic types Int, String, Double, etc. are immutable, so passing them by reference is very safe. (Note: If you are passing a mutable object and you change it, then anyone with a reference to that object will see the change).
在 Scala 中,传递时不会复制任何内容(就 C/C++ 中的按值传递而言)。大多数基本类型 Int、String、Double 等都是不可变的,因此通过引用传递它们是非常安全的。(注意:如果您传递一个可变对象并更改它,那么任何引用该对象的人都会看到更改)。
On top of that, RDDs are lazy, distributed, immutable collections. Passing RDDs through functions and applying transformationto them (map, filter, etc.) doesn't really transfer any data or triggers any computation.
最重要的是,RDD 是惰性的、分布式的、不可变的集合。通过函数传递 RDD 并对它们应用转换(映射、过滤器等)并没有真正传输任何数据或触发任何计算。
All chained transformations are "remembered" and will automatically get triggered in the right order when you enforce and actionon the RDD, such as persisting it, or collecting it locally at the driver (through collect(), take(n), etc.)
所有链接的转换都“记住”,并会自动在当你执行和正确的顺序被触发行动的RDD,如坚持它,或者在本地驱动程序收集它(通过collect(),take(n)等等)
回答by Antoni
Spark implements the principle "send the code to data" rather than sending the data to the code. So here it will happen quite the opposite. It is the function that will be distributed and sent to the RDDs.
Spark 实现了“将代码发送给数据”的原则,而不是将数据发送给代码。所以在这里它会发生完全相反的情况。该函数将被分发并发送到 RDD。
RDDs are immutable, so either your function will create a new RDD as a result (transformation) or create some value (action).
RDDs 是不可变的,所以要么你的函数将创建一个新的 RDD 作为结果(转换),要么创建一些值(操作)。
The interesting question here is, if you define a function, what exactly is sent to the RDD (and distributed among different nodes, with its transfer cost)? A nice explanation here:
这里有趣的问题是,如果你定义了一个函数,究竟什么会发送到 RDD(并分布在不同的节点之间,以及它的传输成本)?这里有一个很好的解释:
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark

