Python Spark 使用前一行的值向数据框添加新列

Question

提问by Kito

I'm wondering how I can achieve the following in Spark (Pyspark)

我想知道如何在 Spark (Pyspark) 中实现以下目标

Initial Dataframe:

初始数据帧：

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe:

结果数据框：

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10)

我设法通过使用以下内容通常将新列“附加”到数据框： df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). I also couldn't find anything in the API documentation on how to access a certain row in a DF by index.

但是，我不知道如何为新列实现这种“行移动”，以便新列具有前一行的字段值（如示例所示）。我也无法在 API 文档中找到任何关于如何通过索引访问 DF 中特定行的内容。

Any help would be appreciated.

任何帮助，将不胜感激。

Answer 1

采纳答案by zero323

You can use lagwindow function as follows

您可以使用lag窗口函数如下

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## |  2|3.0|    5.0|
## |  3|7.0|    3.0|
## |  4|9.0|    7.0|
## +---+---+-------+

but there some important issues:

但有一些重要的问题：

if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient.
you need a natural way to order your data.

如果你需要一个全局操作（不被其他列/列分区），它的效率非常低。
您需要一种自然的方式来排序您的数据。

While the second issue is almost never a problem the first one can be a deal-breaker. If this is the case you should simply convert your DataFrameto RDD and compute lagmanually. See for example:

虽然第二个问题几乎从来都不是问题，但第一个问题可能会破坏交易。如果是这种情况，您应该简单地将您的转换DataFrame为 RDD 并lag手动计算。见例如：

How to transform data with sliding window over time series data in Pyspark
Apache Spark Moving Average(written in Scala, but can be adjusted for PySpark. Be sure to read the comments first).

如何在 Pyspark 中对时间序列数据使用滑动窗口转换数据
Apache Spark 移动平均线（用 Scala 编写，但可以针对 PySpark 进行调整。请务必先阅读评论）。

回答by mputha

   val df = sc.parallelize(Seq((4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0))).toDF("id", "num")
df.show
+---+---+
| id|num|
+---+---+
|  4|9.0|
|  3|7.0|
|  2|3.0|
|  1|5.0|
+---+---+
df.withColumn("new_column", lag("num", 1, 0).over(w)).show
+---+---+----------+
| id|num|new_column|
+---+---+----------+
|  1|5.0|       0.0|
|  2|3.0|       5.0|
|  3|7.0|       3.0|
|  4|9.0|       7.0|
+---+---+----------+

Python Spark 使用前一行的值向数据框添加新列

提问by Kito

采纳答案by zero323

回答by mputha

相关推荐

最近更新

标签

Python Spark 使用前一行的值向数据框添加新列

提问by Kito

采纳答案by zero323

回答by mputha

相关推荐

Python 错误：找不到满足 webdriver 要求的版本（来自版本：）

Python Pandas DataFrame 日期索引的偏移日期

Python 类型错误：不支持 ^ 的操作数类型：'float' 和 'int'

将用户输入限制在 Python 中的某个范围内

相关推荐

最近更新

标签