scala `map` 和 `reduce` 方法如何在 Spark RDD 中工作?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32519070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do `map` and `reduce` methods work in Spark RDDs?
提问by DesirePRG
Following code is from the quick start guide of Apache Spark. Can somebody explain me what is the "line" variable and where it comes from?
以下代码来自 Apache Spark 的快速入门指南。有人可以解释一下什么是“线”变量以及它来自哪里?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
Also, how does a value get passed into a,b?
另外,如何将值传递给 a,b?
Link to the QSG http://spark.apache.org/docs/latest/quick-start.html
链接到 QSG http://spark.apache.org/docs/latest/quick-start.html
回答by Martin Senne
First, according to your link, the textfileis created as
首先,根据您的链接,textfile创建为
val textFile = sc.textFile("README.md")
such that textfileis a RDD[String]meaning it is a resilient distributed dataset of type String. The API to access is very similar to that of regular Scala collections.
使得textfile由RDD[String]这意味着它是一个类型的弹性分布的数据集String。要访问的 API 与常规 Scala 集合的 API 非常相似。
So now what does this mapdo?
那么现在这有什么作用map呢?
Imagine you have a list of Strings and want to convert that into a list of Ints, representing the length of each String.
假设您有一个Strings列表,并希望将其转换为一个 Int 列表,表示每个 String 的长度。
val stringlist: List[String] = List("ab", "cde", "f")
val intlist: List[Int] = stringlist.map( x => x.length )
The mapmethod expects a function. A function, that goes from String => Int. With that function, each element of the list is transformed. So the value of intlist is List( 2, 3, 1 )
该map方法需要一个函数。一个函数,来自String => Int. 使用该函数,列表的每个元素都会被转换。所以 intlist 的值是List( 2, 3, 1 )
Here, we have created an anonymous function from String => Int. That is x => x.length. One can even write the function more explicit as
在这里,我们从String => Int. 那就是x => x.length。甚至可以更明确地编写函数,如
stringlist.map( (x: String) => x.length )
If you do use write the above explicit, you can
如果你确实使用写上面的显式,你可以
val stringLength : (String => Int) = {
x => x.length
}
val intlist = stringlist.map( stringLength )
So, here it is absolutely evident, that stringLength is a function from Stringto Int.
所以,这里很明显,stringLength 是一个函数 from Stringto Int。
Remark: In general, mapis what makes up a so called Functor. While you provide a function from A => B, mapof the functor (here List) allows you use that function also to go from List[A] => List[B]. This is called lifting.
备注:一般来说,map就是所谓的 Functor。当您提供来自 A => Bmap的函数时,函子(此处为 List)的函数也允许您使用该函数从List[A] => List[B]. 这称为提升。
Answers to your questions
回答您的问题
What is the "line" variable?
什么是“线”变量?
As mentioned above, lineis the input parameter of the function line => line.split(" ").size
如上所述,line是函数的输入参数line => line.split(" ").size
More explicit
(line: String) => line.split(" ").size
更明确
(line: String) => line.split(" ").size
Example: If lineis "hello world", the function returns 2.
示例:如果line是“hello world”,则函数返回 2。
"hello world"
=> Array("hello", "world") // split
=> 2 // size of Array
How does a value get passed into a,b?
值如何传递给 a,b?
reducealso expects a function from (A, A) => A, where Ais the type of your RDD. Lets call this function op.
reduce还需要一个函数 from (A, A) => A,A你的RDD. 让我们调用这个函数op。
What does reduce. Example:
有什么作用reduce。例子:
List( 1, 2, 3, 4 ).reduce( (x,y) => x + y )
Step 1 : op( 1, 2 ) will be the first evaluation.
Start with 1, 2, that is
x is 1 and y is 2
Step 2: op( op( 1, 2 ), 3 ) - take the next element 3
Take the next element 3:
x is op(1,2) = 3 and y = 3
Step 3: op( op( op( 1, 2 ), 3 ), 4)
Take the next element 4:
x is op(op(1,2), 3 ) = op( 3,3 ) = 6 and y is 4
Result here is the sum of the list elements, 10.
这里的结果是列表元素的总和,10。
Remark: In general reducecalculates
备注:一般reduce计算
op( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)
Full example
完整示例
First, textfile is a RDD[String], say
首先,文本文件是一个 RDD[String],比如说
TextFile
"hello Tyth"
"cool example, eh?"
"goodbye"
TextFile.map(line => line.split(" ").size)
2
3
1
TextFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
3
Steps here, recall `(a, b) => if (a > b) a else b)`
- op( op(2, 3), 1) evaluates to op(3, 1), since op(2, 3) = 3
- op( 3, 1 ) = 3
回答by Tyth
Mapand reduceare methods of RDD class, which has interface similar to scala collections.
Map和reduce是RDD类的方法,它具有类似于scala集合的接口。
What you pass to methods mapand reduceare actually anonymous function (with one param in map, and with two parameters in reduce). textFilecalls provided function for every element (line of text in this context) it has.
你传递给方法map和reduce实际上是匿名函数(在地图上一个PARAM,并用两个参数在减少)。textFile为它拥有的每个元素(此上下文中的文本行)调用提供的函数。
Maybe you should read some scala collection introduction first.
也许你应该先阅读一些 Scala 集合介绍。
You can read more about RDD class API here: https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.rdd.RDD
您可以在此处阅读有关 RDD 类 API 的更多信息:https: //spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.rdd.RDD
回答by Gaurang Shah
what mapfunction does is, it takes the list of arguments and map it to some function. Similar to map function in python, if you are familiar.
什么地图功能的作用是,它需要的参数列表并将其映射到一些功能。类似于python中的map函数,如果你熟悉的话。
Also, File is like a list of Strings. (not exactly but that's how it's being iterated)
此外,文件就像一个字符串列表。(不完全是,但这就是它的迭代方式)
Let's consider this is your file.
让我们考虑一下这是您的文件。
val list_a: List[String] = List("first line", "second line", "last line")
Now let's see how mapfunction works.
现在让我们看看map函数是如何工作的。
We need two things, list of valueswhich we already have and functionto which we want to map this values. let's consider really simple function for understanding.
我们需要两件东西,list of values我们已经拥有并且function我们希望将这些值映射到它。让我们考虑一下非常简单的函数来理解。
val myprint = (arg:String)=>println(arg)
this function simply takes single String argument and prints on the console.
此函数仅采用单个 String 参数并在控制台上打印。
myprint("hello world")
hello world
if we match this function to your list, it's gonna print all the lines
如果我们将此函数与您的列表匹配,它将打印所有行
list_a.map(myprint)
We can write an anonymous function as mentioned below as well, which does the same thing.
我们也可以编写一个匿名函数,如下所述,它做同样的事情。
list_a.map(arg=>println(arg))
in your case, lineis the first line of the file. you could change the argument name as you like. for example, in above example, if I change argto lineit would work without any issue
在您的情况下,line是文件的第一行。您可以根据需要更改参数名称。例如,在上面的例子中,如果我改变arg到line它的工作没有任何问题
list_a.map(line=>println(line))

