Apache Spark 和 python lambda

Question

提问by jhon.smith

I have the following code

我有以下代码

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

http://spark.apache.org/examples.htmli have copied the example from here

http://spark.apache.org/examples.html我从这里复制了示例

I am unable to understand this code especially the keywords

我无法理解此代码，尤其是关键字

flatmap,
map and
reduceby

平面图，
地图和
减少

can someone please explain in plain english what's going on.

有人可以用简单的英语解释发生了什么。

Answer 1

采纳答案by aaronman

mapis the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMapis the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between mapand flatMap. Lastly reduceByKeytakes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every Vfor each Kin your sequence of (K,V)pairs.

map是最简单的，它本质上是说对序列的每个元素执行给定的操作并返回结果序列（与 foreach 非常相似）。flatMap是同一件事，但不是每个元素只返回一个元素，而是允许返回一个序列（可以为空）。这里有一个答案解释之间的差异map和flatMap。最后reduceByKey采用一个聚合函数（意味着它需要两个相同类型的参数并返回该类型，也应该是可交换的和关联的，否则你会得到不一致的结果），它用于聚合每个V对K序列中的每一个(K,V)。

EXAMPLE^*:
reduce (lambda a, b: a + b,[1,2,3,4])

示例^*：
reduce (lambda a, b: a + b,[1,2,3,4])

This says aggregate the whole list with +so it will do

这说聚合整个列表，+所以它会做

1 + 2 = 3  
3 + 3 = 6  
6 + 4 = 10  
final result is 10

Reduce by key is the same thing except you do a reduce for each unique key.

按键减少是同一件事，除了您对每个唯一键进行减少。

So to explain it in your example

所以在你的例子中解释它

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
             .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
             .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS

So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.

所以，为什么要以这种方式计算单词，原因是 MapReduce 编程范式是高度可并行化的，因此可以扩展到对 TB 甚至 PB 的数据进行这种计算。

_{I don't use python much tell me if I made a mistake.}

_{我不怎么使用 python 告诉我我是否犯了错误。}

Answer 2

回答by alfasin

See inline-comments:

见内嵌评论：

file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \  # iterate over the lines, split each line by space (into words)
             .map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
             .reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")

A more detailed explanation of reduceByKey can be found here

可以在此处找到关于 reduceByKey 的更详细说明

Answer 3

回答by Alexk

The answers here are accurate at the code level but it may help to understand what goes on under the hood.

这里的答案在代码级别是准确的，但它可能有助于理解幕后发生的事情。

My understanding is that when a reduce operation is called there is a massive data shuffle that results in all K-V pairs obtained by a map() operation that have the same value of the key being assigned to a task that sums the values in the collection of K-V pairs. These tasks are then assigned to different physical processors and the results are then collated with another data shuffle.

我的理解是，当调用 reduce 操作时，会发生大量数据混洗，导致 map() 操作获得的所有 KV 对具有相同的键值被分配给一个任务，该任务对集合中的值求和KV 对。然后将这些任务分配给不同的物理处理器，然后将结果与另一个数据洗牌进行整理。

so if the map operation produces (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)

所以如果映射操作产生 (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)

The reduce operation produces (cat 4) (dog 2)

reduce 操作产生 (cat 4) (dog 2)

Hope this helps

希望这可以帮助

Apache Spark 和 python lambda

提问by jhon.smith

采纳答案by aaronman

回答by alfasin

回答by Alexk

相关推荐

最近更新

标签

Apache Spark 和 python lambda

提问by jhon.smith

采纳答案by aaronman

回答by alfasin

回答by Alexk

相关推荐

Python 迭代 list["a","b","c"] 时出现错误“'type' 对象没有属性 '__getitem__'”

Python 熊猫“描述”没有返回所有列的摘要

Python 在scikit-learn中预处理后如何保留数据框的列标题

Python 如何将 matplotlib 图的输出作为 SVG 获取？

相关推荐

最近更新

标签

Python 迭代 list["a","b","c"] 时出现错误“'type' 对象没有属性 'getitem'”