scala 在 Spark Dataframe 中的列列表中添加一列 rowsum

Question

提问by Sarah

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.

我有一个包含多列的 Spark 数据框。我想在数据框中添加一列，该列是一定数量的列的总和。

For example, my data looks like this:

例如，我的数据如下所示：

ID var1 var2 var3 var4 var5
a   5     7    9    12   13
b   6     4    3    20   17
c   4     9    4    6    9
d   1     2    6    8    1

I want a column added summing the rows for specific columns:

我想要添加一列，汇总特定列的行：

ID var1 var2 var3 var4 var5   sums
a   5     7    9    12   13    46
b   6     4    3    20   17    50
c   4     9    4    6    9     32
d   1     2    6    8    10    27

I know it is possible to add columns together if you know the specific columns to add:

我知道如果您知道要添加的特定列，则可以将列添加在一起：

val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))

But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:

但是是否可以传递列名列表并将它们添加在一起？基于这个基本上是我想要的答案，但它使用的是 python API 而不是 scala（在 PySpark 数据框中添加列总和作为新列）我认为这样的事情会起作用：

//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")

// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)

This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?

这会引发错误值 sum 不是 org.apache.spark.sql.DataFrame 的成员。有没有办法跨列求和？

Thanks in advance for your help.

在此先感谢您的帮助。

Answer 1

回答by Pawe? Jurczenko

You should try the following:

您应该尝试以下操作：

import org.apache.spark.sql.functions._

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val input = sc.parallelize(Seq(
  ("a", 5, 7, 9, 12, 13),
  ("b", 6, 4, 3, 20, 17),
  ("c", 4, 9, 4, 6 , 9),
  ("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")

val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))

val output = input.withColumn("sums", columnsToSum.reduce(_ + _))

output.show()

Then the result is:

那么结果是：

+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
|  a|   5|   7|   9|  12|  13|  46|
|  b|   6|   4|   3|  20|  17|  50|
|  c|   4|   9|   4|   6|   9|  32|
|  d|   1|   2|   6|   8|   1|  18|
+---+----+----+----+----+----+----+

Answer 2

回答by zero323

Plain and simple:

干净利落：

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}

def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)

val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))

with Python equivalent:

与 Python 等效：

from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col

def sum_(*cols):
    return reduce(add, cols, lit(0))

columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))

Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fillor coalescefunction to avoid that.

如果行中有缺失值，两者都将默认为 NA。您可以使用DataFrameNaFunctions.fill或coalesce函数来避免这种情况。

Answer 3

回答by Abu Shoeb

I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This posthas the same answer.

我假设你有一个数据框 df。然后您可以总结除您的 ID col 之外的所有 col。当您有很多列并且您不想像上面提到的每个人一样手动提及所有列的名称时，这很有用。这个帖子有相同的答案。

val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)

Answer 4

回答by Aerianis

Here's an elegant solution using python:

这是一个使用 python 的优雅解决方案：

NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))

Hopefully this will influence something similar in Spark ... anyone?.

希望这会影响 Spark 中类似的东西......有人吗？。

scala 在 Spark Dataframe 中的列列表中添加一列 rowsum

提问by Sarah

回答by Pawe? Jurczenko

回答by zero323

回答by Abu Shoeb

回答by Aerianis

相关推荐

最近更新

标签

scala 在 Spark Dataframe 中的列列表中添加一列 rowsum

提问by Sarah

回答by Pawe? Jurczenko

回答by zero323

回答by Abu Shoeb

回答by Aerianis

相关推荐

替换 csv 文件中的新行 (\n) 字符 - spark scala

scala 如何将基于案例类的 RDD 转换为 DataFrame？

scala 如何将 RDD[Row] 转换回 DataFrame

scala 火花中的RDD聚合

相关推荐

最近更新

标签