Python 如何在 PySpark 中创建一个返回字符串数组的 udf?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47682927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:19:07  来源:igfitidea点击:

How to create a udf in PySpark which returns an array of strings?

pythonapache-sparkpysparkapache-spark-sqluser-defined-functions

提问by Hunle

I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType).

我有一个 udf,它返回一个字符串列表。这应该不会太难。我执行UDF时,因为它返回一个字符串数组传递的数据类型: ArrayType(StringType)

Now, somehow this is not working:

现在,不知何故这不起作用:

the dataframe i'm operating on is df_subsets_concatand looks like this:

我正在操作的数据框是df_subsets_concat这样的:

df_subsets_concat.show(3,False)
+----------------------+
|col1                  |
+----------------------+
|oculunt               |
|predistposed          |
|incredulous           |
+----------------------+
only showing top 3 rows

and the code is

代码是

from pyspark.sql.types import ArrayType, FloatType, StringType

my_udf = lambda domain: ['s','n']
label_udf = udf(my_udf, ArrayType(StringType))
df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))

and the result is

结果是

/usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull)
    288         False
    289         """
--> 290         assert isinstance(elementType, DataType), "elementType should be DataType"
    291         self.elementType = elementType
    292         self.containsNull = containsNull

AssertionError: elementType should be DataType

It is my understanding that this was the correct way to do this. Here are some resources: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType"How to return a "Tuple type" in a UDF in PySpark?

我的理解是这是正确的方法。以下是一些资源: pySpark 数据帧“assert isinstance(dataType, DataType),”dataType should be DataType”如何在 PySpark 的 UDF 中返回“元组类型”?

But neither of these have helped me resolve why this is not working. i am using pyspark 1.6.1.

但是这些都没有帮助我解决为什么这不起作用。我正在使用 pyspark 1.6.1。

How to create a udf in pyspark which returns an array of strings?

如何在pyspark中创建一个返回字符串数组的udf?

回答by Psidom

You need to initialize a StringTypeinstance:

您需要初始化一个StringType实例:

label_udf = udf(my_udf, ArrayType(StringType()))
#                                           ^^ 
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
|        col1|subset|
+------------+------+
|     oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+