scala 在 Spark Dataframe 中的列列表中添加一列 rowsum
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37624699/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Adding a column of rowsums across a list of columns in Spark Dataframe
提问by Sarah
I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
我有一个包含多列的 Spark 数据框。我想在数据框中添加一列,该列是一定数量的列的总和。
For example, my data looks like this:
例如,我的数据如下所示:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
我想要添加一列,汇总特定列的行:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
我知道如果您知道要添加的特定列,则可以将列添加在一起:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
但是是否可以传递列名列表并将它们添加在一起?基于这个基本上是我想要的答案,但它使用的是 python API 而不是 scala(在 PySpark 数据框中添加列总和作为新列)我认为这样的事情会起作用:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
这会引发错误值 sum 不是 org.apache.spark.sql.DataFrame 的成员。有没有办法跨列求和?
Thanks in advance for your help.
在此先感谢您的帮助。
回答by Pawe? Jurczenko
You should try the following:
您应该尝试以下操作:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
那么结果是:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
回答by zero323
Plain and simple:
干净利落:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
与 Python 等效:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fillor coalescefunction to avoid that.
如果行中有缺失值,两者都将默认为 NA。您可以使用DataFrameNaFunctions.fill或coalesce函数来避免这种情况。
回答by Abu Shoeb
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This posthas the same answer.
我假设你有一个数据框 df。然后您可以总结除您的 ID col 之外的所有 col。当您有很多列并且您不想像上面提到的每个人一样手动提及所有列的名称时,这很有用。这个帖子有相同的答案。
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
回答by Aerianis
Here's an elegant solution using python:
这是一个使用 python 的优雅解决方案:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.
希望这会影响 Spark 中类似的东西......有人吗?。

