scala 如何在spark/scala中对数据帧的一列的值求和

Question

提问by Ectoras

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

我有一个从 CSV 文件中读取的数据框，其中包含许多列，例如：时间戳、步数、心率等。

I want to sum the values of each column, for instance the total number of steps on "steps" column.

我想对每列的值求和，例如“步骤”列上的总步数。

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

据我所知，我想使用这些功能：http: //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

但我可以理解如何使用函数 sum。

When I write the following:

当我写以下内容时：

val df = CSV.load(args(0))
val sumSteps = df.sum("steps")

the function sum cannot be resolved.

无法解析函数 sum。

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

我是否错误地使用了 sum 函数？我需要先使用函数映射吗？如果是，如何？

A simple example would be very helpful! I started writing Scala recently.

一个简单的例子会很有帮助！我最近开始写 Scala。

Answer 1

采纳答案by Alberto Bonsanto

If you want to sumall values of one column, it's more efficient to use DataFrame's internal RDDand reduce.

如果您想要sum一列的所有值，使用DataFrame的内部RDD和reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

Answer 2

回答by Daniel de Paula

You must first import the functions:

您必须首先导入函数：

import org.apache.spark.sql.functions._

Then you can use them like this:

然后你可以像这样使用它们：

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

如果需要，您还可以投射结果：

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

编辑：

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

对于多列（例如“col1”、“col2”、...），您可以一次获取所有聚合：

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

编辑2：

For dynamically applying the aggregations, the following options are available:

对于动态应用聚合，可以使用以下选项：

Applying to all numeric columns at once:

一次应用于所有数字列：

df.groupBy().sum()

Applying to a list of numeric column names:

应用于数字列名称列表：

val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)

Applying to a list of numeric column names with aliases and/or casts:

应用于具有别名和/或强制转换的数字列名称列表：

val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()

Answer 3

回答by shankarj67

Simply apply aggregation function, Sumon your column

只需在您的列上应用聚合函数Sum

df.groupby('steps').sum().show()

Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

按照文档http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

也请查看此链接https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Answer 4

回答by Marcos

Not sure this was around when this question was asked but:

不确定在问这个问题时是否存在这种情况，但是：

df.describe().show("columnName")

gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()

给出列的均值、计数、标准差统计信息。我认为如果你只是这样做，它会在所有列上返回.show()

Answer 5

回答by Omkar

Using spark sql query..just incase if it helps anyone!

使用 spark sql 查询 .. 以防万一它对任何人都有帮助！

import org.apache.spark.sql.SparkSession 
import org.apache.spark.SparkConf 
import org.apache.spark.sql.functions._ 
import org.apache.spark.SparkContext 
import java.util.stream.Collectors

val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()

df.createOrReplaceTempView("steps")
val sum = spark.sql("select  sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

scala 如何在spark/scala中对数据帧的一列的值求和

提问by Ectoras

采纳答案by Alberto Bonsanto

回答by Daniel de Paula

回答by shankarj67

回答by Marcos

回答by Omkar

相关推荐

最近更新

标签

scala 如何在spark/scala中对数据帧的一列的值求和

提问by Ectoras

采纳答案by Alberto Bonsanto

回答by Daniel de Paula

回答by shankarj67

回答by Marcos

回答by Omkar

相关推荐

scala Spark 的 Column.isin 函数不带 List

Scala 恢复或recoverWith

scala 如何获得 Seq 的第 n 个元素？

scala 如何在数据集中存储自定义对象？

相关推荐

最近更新

标签