Python 如何在 PySpark 的 UDF 中返回“元组类型”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36840563/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to return a "Tuple type" in a UDF in PySpark?
提问by kamalbanga
All the data types in pyspark.sql.types
are:
__all__ = [
"DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
"TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
"LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]
I have to write a UDF (in pyspark) which returns an array of tuples. What do I give the second argument to it which is the return type of the udf method? It would be something on the lines of ArrayType(TupleType())
...
我必须编写一个返回元组数组的 UDF(在 pyspark 中)。我给它的第二个参数是什么,它是 udf 方法的返回类型?这将是ArrayType(TupleType())
……
回答by zero323
There is no such thing as a TupleType
in Spark. Product types are represented as structs
with fields of specific type. For example if you want to return an array of pairs (integer, string) you can use schema like this:
TupleType
Spark 中没有 a 这样的东西。产品类型表示为structs
具有特定类型的字段。例如,如果您想返回一组对(整数、字符串),您可以使用如下模式:
from pyspark.sql.types import *
schema = ArrayType(StructType([
StructField("char", StringType(), False),
StructField("count", IntegerType(), False)
]))
Example usage:
用法示例:
from pyspark.sql.functions import udf
from collections import Counter
char_count_udf = udf(
lambda s: Counter(s).most_common(),
schema
)
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])
df.select("*", char_count_udf(df["value"])).show(2, False)
## +---+-----+-------------------------+
## |id |value|PythonUDF#<lambda>(value)|
## +---+-----+-------------------------+
## |1 |foo |[[o,2], [f,1]] |
## |2 |bar |[[r,1], [a,1], [b,1]] |
## +---+-----+-------------------------+
回答by y.selivonchyk
Stackoverflow keeps directing me to this question, so I guess I'll add some info here.
Stackoverflow 一直在引导我解决这个问题,所以我想我会在这里添加一些信息。
Returning simple types from UDF:
从 UDF 返回简单类型:
from pyspark.sql.types import *
from pyspark.sql import functions as F
def get_df():
d = [(0.0, 0.0), (0.0, 3.0), (1.0, 6.0), (1.0, 9.0)]
df = sqlContext.createDataFrame(d, ['x', 'y'])
return df
df = get_df()
df.show()
# +---+---+
# | x| y|
# +---+---+
# |0.0|0.0|
# |0.0|3.0|
# |1.0|6.0|
# |1.0|9.0|
# +---+---+
func = udf(lambda x: str(x), StringType())
df = df.withColumn('y_str', func('y'))
func = udf(lambda x: int(x), IntegerType())
df = df.withColumn('y_int', func('y'))
df.show()
# +---+---+-----+-----+
# | x| y|y_str|y_int|
# +---+---+-----+-----+
# |0.0|0.0| 0.0| 0|
# |0.0|3.0| 3.0| 3|
# |1.0|6.0| 6.0| 6|
# |1.0|9.0| 9.0| 9|
# +---+---+-----+-----+
df.printSchema()
# root
# |-- x: double (nullable = true)
# |-- y: double (nullable = true)
# |-- y_str: string (nullable = true)
# |-- y_int: integer (nullable = true)
When integers are not enough:
当整数不够时:
df = get_df()
func = udf(lambda x: [0]*int(x), ArrayType(IntegerType()))
df = df.withColumn('list', func('y'))
func = udf(lambda x: {float(y): str(y) for y in range(int(x))},
MapType(FloatType(), StringType()))
df = df.withColumn('map', func('y'))
df.show()
# +---+---+--------------------+--------------------+
# | x| y| list| map|
# +---+---+--------------------+--------------------+
# |0.0|0.0| []| Map()|
# |0.0|3.0| [0, 0, 0]|Map(2.0 -> 2, 0.0...|
# |1.0|6.0| [0, 0, 0, 0, 0, 0]|Map(0.0 -> 0, 5.0...|
# |1.0|9.0|[0, 0, 0, 0, 0, 0...|Map(0.0 -> 0, 5.0...|
# +---+---+--------------------+--------------------+
df.printSchema()
# root
# |-- x: double (nullable = true)
# |-- y: double (nullable = true)
# |-- list: array (nullable = true)
# | |-- element: integer (containsNull = true)
# |-- map: map (nullable = true)
# | |-- key: float
# | |-- value: string (valueContainsNull = true)
Returning complex datatypes from UDF:
从 UDF 返回复杂数据类型:
df = get_df()
df = df.groupBy('x').agg(F.collect_list('y').alias('y[]'))
df.show()
# +---+----------+
# | x| y[]|
# +---+----------+
# |0.0|[0.0, 3.0]|
# |1.0|[9.0, 6.0]|
# +---+----------+
schema = StructType([
StructField("min", FloatType(), True),
StructField("size", IntegerType(), True),
StructField("edges", ArrayType(FloatType()), True),
StructField("val_to_index", MapType(FloatType(), IntegerType()), True)
# StructField('insanity', StructType([StructField("min_", FloatType(), True), StructField("size_", IntegerType(), True)]))
])
def func(values):
mn = min(values)
size = len(values)
lst = sorted(values)[::-1]
val_to_index = {x: i for i, x in enumerate(values)}
return (mn, size, lst, val_to_index)
func = udf(func, schema)
dff = df.select('*', func('y[]').alias('complex_type'))
dff.show(10, False)
# +---+----------+------------------------------------------------------+
# |x |y[] |complex_type |
# +---+----------+------------------------------------------------------+
# |0.0|[0.0, 3.0]|[0.0,2,WrappedArray(3.0, 0.0),Map(0.0 -> 0, 3.0 -> 1)]|
# |1.0|[6.0, 9.0]|[6.0,2,WrappedArray(9.0, 6.0),Map(9.0 -> 1, 6.0 -> 0)]|
# +---+----------+------------------------------------------------------+
dff.printSchema()
# +---+----------+------------------------------------------------------+
# |x |y[] |complex_type |
# +---+----------+------------------------------------------------------+
# |0.0|[0.0, 3.0]|[0.0,2,WrappedArray(3.0, 0.0),Map(0.0 -> 0, 3.0 -> 1)]|
# |1.0|[6.0, 9.0]|[6.0,2,WrappedArray(9.0, 6.0),Map(9.0 -> 1, 6.0 -> 0)]|
# +---+----------+------------------------------------------------------+
Passing multiple arguments to a UDF:
将多个参数传递给 UDF:
df = get_df()
func = udf(lambda arr: arr[0]*arr[1],FloatType())
df = df.withColumn('x*y', func(F.array('x', 'y')))
# +---+---+---+
# | x| y|x*y|
# +---+---+---+
# |0.0|0.0|0.0|
# |0.0|3.0|0.0|
# |1.0|6.0|6.0|
# |1.0|9.0|9.0|
# +---+---+---+
The code is purely for demo purposes, all above transformation are available in Spark code and would yield much better performance. As @zero323 in the comment above, UDFs should generally be avoided in pyspark; returning complex types should make you think about simplifying your logic.
代码纯粹用于演示目的,以上所有转换都可以在 Spark 代码中使用,并且会产生更好的性能。正如上面评论中的@zero323 一样,在 pyspark 中通常应避免使用 UDF;返回复杂类型应该让您考虑简化您的逻辑。
回答by loneStar
For the scala version instead of python. version 2.4
对于 Scala 版本而不是 python。版本 2.4
import org.apache.spark.sql.types._
val testschema : StructType = StructType(
StructField("number", IntegerType) ::
StructField("Array", ArrayType(StructType(StructField("cnt_rnk", IntegerType) :: StructField("comp", StringType) :: Nil))) ::
StructField("comp", StringType):: Nil)
The tree structure looks like this.
树结构看起来像这样。
testschema.printTreeString
root
|-- number: integer (nullable = true)
|-- Array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cnt_rnk: integer (nullable = true)
| | |-- corp_id: string (nullable = true)
|-- comp: string (nullable = true)