Python 如何在带有 pyspark 的 spark 中使用“for”循环

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40686233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:49:06  来源:igfitidea点击:

How can I use "for" loop in spark with pyspark

pythonfor-looppyspark

提问by Linghao

I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3'), the part "2,3"was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)

我在我的项目中使用 spark 和 python3 时遇到了一个问题。在键值对中,例如('1','+1 2,3'),部分"2,3"是我想要检查的内容。所以我写了下面的代码:(
假设这个键值对保存在一个名为 p_list 的 RDD 中)



def add_label(x):   
    label=x[1].split()[0]  
    value=x[1].split()[1].split(",")  
    for i in value:     
        return (i,label)  
p_list=p_list.map(add_label)


After doing like that, I could only get the result: ('2','+1')and it should be ('2','+1')and ('3','+1'). It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

这样做之后,我只能得到结果:('2','+1')它应该是('2','+1')and ('3','+1')。似乎地图操作中的“for”循环只执行了一次。我怎样才能让它做多次?或者有没有其他方法可以用来在map操作或reduce操作中实现像“for”循环这样的功能?

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?

我想提一下,我真正处理的是一个大数据集。所以我必须使用 AWS 集群并通过并行化实现循环。集群中的从节点似乎不理解循环。我如何让他们知道 Spark RDD 功能?或者如何以另一种管道方式进行这样的循环操作(这是Spark RDD的主要设计之一)?

采纳答案by Matt Cremeens

Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

您的 return 语句不能在循环内;否则,它在第一次迭代后返回,永远不会进入第二次迭代。

What you could try is this

你可以尝试的是这个

result = []
for i in value:
    result.append((i,label))
return result

and then resultwould be a list of all of the tuples created inside the loop.

然后result将是在循环内创建的所有元组的列表。