scala Spark - 行值的总和

Question

提问by karoma

I have the following DataFrame:

我有以下数据帧：

January | February | March
-----------------------------
  10    |    10    |  10
  20    |    20    |  20
  50    |    50    |  50

I'm trying to add a column to this which is the sum of the values of each row.

我正在尝试为此添加一列，这是每行值的总和。

January | February | March  | TOTAL
----------------------------------
  10    |    10    |   10   |  30
  20    |    20    |   20   |  60
  50    |    50    |   50   |  150

As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?

据我所知，所有内置的聚合函数似乎都是用于计算单列中的值。如何在每行的基础上跨列使用值（使用 Scala）？

I've gotten as far as

我已经到了

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

Answer 1

回答by David Griffin

You were very close with this:

你非常接近这个：

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

Instead, try this:

相反，试试这个：

val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")

I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!

我认为这是最好的答案，因为它与硬编码 SQL 查询的答案一样快，并且与使用UDF. 这是两全其美的——而且我什至没有添加完整的代码行！

Answer 2

回答by Alberto Bonsanto

Alternatively and using Hugo's approach and example, you can create a UDFthat receives any quantity of columns and sumthem all.

或者，使用 Hugo 的方法和示例，您可以创建一个UDF接收任意数量的列和sum所有列。

from functools import reduce

def superSum(*cols):
   return reduce(lambda a, b: a + b, cols)

add = udf(superSum)

df.withColumn('total', add(*[df[x] for x in df.columns])).show()


+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

Answer 3

回答by Hugo Reyes

This code is in Python, but it can be easily translated:

这段代码是用 Python 编写的，但它可以很容易地翻译：

# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()

# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March

df.withColumn('TOTAL', df.January + df.February + df.March).show()

Output:

输出：

+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
|     10|      10|   10|
|     20|      20|   20|
+-------+--------+-----+

+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

You can also create an User Defined Function it you want, here a link of Spark documentation: UserDefinedFunction (udf)

您还可以创建您想要的用户定义函数，这里是 Spark 文档的链接： UserDefinedFunction (udf)

Answer 4

回答by Pawe? Kaczorowski

Working Scala example with dynamic column selection:

具有动态列选择的工作 Scala 示例：

import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()

+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
|     10|      10|   10|
|     20|      20|   20|
+-------+--------+-----+

val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()

+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

Answer 5

回答by Himaprasoon

You can use expr() for this.In scala use

您可以为此使用 expr()。在 scala 中使用

df.withColumn("TOTAL", expr("January+February+March"))

scala Spark - 行值的总和

提问by karoma

回答by David Griffin

回答by Alberto Bonsanto

回答by Hugo Reyes

回答by Pawe? Kaczorowski

回答by Himaprasoon

相关推荐

最近更新

标签

scala Spark - 行值的总和

提问by karoma

回答by David Griffin

回答by Alberto Bonsanto

回答by Hugo Reyes

回答by Pawe? Kaczorowski

回答by Himaprasoon

相关推荐

如何在 Scala 中的语句之间等待 N 秒？

scala 如何使用 json4s 从 akka-http 响应实体读取 json 响应

scala 如何将额外的参数传递给 Spark SQL 中的 UDF？

如何在 Jupyter IPython Notebook 中安装 Scala？

相关推荐

最近更新

标签