Python Pyspark Dataframe 上的 Pivot String 列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37486910/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pivot String column on Pyspark Dataframe
提问by Ivan
I have a simple dataframe like this:
我有一个像这样的简单数据框:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
and I need to pivot it by date:
我需要按日期旋转它:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
Everything works as expected. But now I need to pivot it and get a non-numeric column:
一切都按预期工作。但现在我需要旋转它并获得一个非数字列:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
and of course I would get an exception:
当然我会得到一个例外:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
I would like to generate something on the line of
我想生成一些关于
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
Is that possible with pivot
?
这可能pivot
吗?
回答by zero323
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
假设(id |type | date)
组合是唯一的并且您唯一的目标是旋转而不是您可以使用的聚合first
(或任何其他不限于数值的函数):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
如果这些假设不正确,您将必须预先聚合您的数据。例如对于最常见的ship
值:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+