Apache Spark 和 python lambda
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24575125/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apache spark and python lambda
提问by jhon.smith
I have the following code
我有以下代码
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
http://spark.apache.org/examples.htmli have copied the example from here
http://spark.apache.org/examples.html我从这里复制了示例
I am unable to understand this code especially the keywords
我无法理解此代码,尤其是关键字
- flatmap,
- map and
- reduceby
- 平面图,
- 地图和
- 减少
can someone please explain in plain english what's going on.
有人可以用简单的英语解释发生了什么。
采纳答案by aaronman
map
is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMap
is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between map
and flatMap
. Lastly reduceByKey
takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every V
for each K
in your sequence of (K,V)
pairs.
map
是最简单的,它本质上是说对序列的每个元素执行给定的操作并返回结果序列(与 foreach 非常相似)。flatMap
是同一件事,但不是每个元素只返回一个元素,而是允许返回一个序列(可以为空)。这里有一个答案解释之间的差异map
和flatMap
。最后reduceByKey
采用一个聚合函数(意味着它需要两个相同类型的参数并返回该类型,也应该是可交换的和关联的,否则你会得到不一致的结果),它用于聚合每个V
对K
序列中的每一个(K,V)
。
EXAMPLE*:reduce (lambda a, b: a + b,[1,2,3,4])
示例*:reduce (lambda a, b: a + b,[1,2,3,4])
This says aggregate the whole list with +
so it will do
这说聚合整个列表,+
所以它会做
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
final result is 10
Reduce by key is the same thing except you do a reduce for each unique key.
按键减少是同一件事,除了您对每个唯一键进行减少。
So to explain it in your example
所以在你的例子中解释它
file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
.map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
.reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS
So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.
所以,为什么要以这种方式计算单词,原因是 MapReduce 编程范式是高度可并行化的,因此可以扩展到对 TB 甚至 PB 的数据进行这种计算。
I don't use python much tell me if I made a mistake.
我不怎么使用 python 告诉我我是否犯了错误。
回答by alfasin
See inline-comments:
见内嵌评论:
file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \ # iterate over the lines, split each line by space (into words)
.map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
.reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")
A more detailed explanation of reduceByKey can be found here
可以在此处找到关于 reduceByKey 的更详细说明
回答by Alexk
The answers here are accurate at the code level but it may help to understand what goes on under the hood.
这里的答案在代码级别是准确的,但它可能有助于理解幕后发生的事情。
My understanding is that when a reduce operation is called there is a massive data shuffle that results in all K-V pairs obtained by a map() operation that have the same value of the key being assigned to a task that sums the values in the collection of K-V pairs. These tasks are then assigned to different physical processors and the results are then collated with another data shuffle.
我的理解是,当调用 reduce 操作时,会发生大量数据混洗,导致 map() 操作获得的所有 KV 对具有相同的键值被分配给一个任务,该任务对集合中的值求和KV 对。然后将这些任务分配给不同的物理处理器,然后将结果与另一个数据洗牌进行整理。
so if the map operation produces (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)
所以如果映射操作产生 (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)
The reduce operation produces (cat 4) (dog 2)
reduce 操作产生 (cat 4) (dog 2)
Hope this helps
希望这可以帮助