Python 如何在带有 pyspark 的 spark 中使用“for”循环

Question

提问by Linghao

I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3'), the part "2,3"was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)

我在我的项目中使用 spark 和 python3 时遇到了一个问题。在键值对中，例如('1','+1 2,3')，部分"2,3"是我想要检查的内容。所以我写了下面的代码：（
假设这个键值对保存在一个名为 p_list 的 RDD 中）

def add_label(x):   
    label=x[1].split()[0]  
    value=x[1].split()[1].split(",")  
    for i in value:     
        return (i,label)  
p_list=p_list.map(add_label)

After doing like that, I could only get the result: ('2','+1')and it should be ('2','+1')and ('3','+1'). It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

这样做之后，我只能得到结果：('2','+1')它应该是('2','+1')and ('3','+1')。似乎地图操作中的“for”循环只执行了一次。我怎样才能让它做多次？或者有没有其他方法可以用来在map操作或reduce操作中实现像“for”循环这样的功能？

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?

我想提一下，我真正处理的是一个大数据集。所以我必须使用 AWS 集群并通过并行化实现循环。集群中的从节点似乎不理解循环。我如何让他们知道 Spark RDD 功能？或者如何以另一种管道方式进行这样的循环操作（这是Spark RDD的主要设计之一）？

Answer 1

采纳答案by Matt Cremeens

Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

您的 return 语句不能在循环内；否则，它在第一次迭代后返回，永远不会进入第二次迭代。

What you could try is this

你可以尝试的是这个

result = []
for i in value:
    result.append((i,label))
return result

and then resultwould be a list of all of the tuples created inside the loop.

然后result将是在循环内创建的所有元组的列表。

Python 如何在带有 pyspark 的 spark 中使用“for”循环

提问by Linghao

采纳答案by Matt Cremeens

相关推荐

最近更新

标签

Python 如何在带有 pyspark 的 spark 中使用“for”循环

提问by Linghao

采纳答案by Matt Cremeens

相关推荐

Python Visual Studio Code - 删除 pylint

Python 类型错误：预期的 str、bytes 或 os.PathLike 对象，而不是 _io.BufferedReader

在路径中找不到“dot.exe”。Python 上的 Pydot (Windows 7)

从python中的文件加载json后检查密钥是否丢失

相关推荐

最近更新

标签