scala 如何在spark/scala中对数据帧的一列的值求和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37032025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to sum the values of one column of a dataframe in spark/scala
提问by Ectoras
I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
我有一个从 CSV 文件中读取的数据框,其中包含许多列,例如:时间戳、步数、心率等。
I want to sum the values of each column, for instance the total number of steps on "steps" column.
我想对每列的值求和,例如“步骤”列上的总步数。
As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
据我所知,我想使用这些功能:http: //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
但我可以理解如何使用函数 sum。
When I write the following:
当我写以下内容时:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
无法解析函数 sum。
Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?
我是否错误地使用了 sum 函数?我需要先使用函数映射吗?如果是,如何?
A simple example would be very helpful! I started writing Scala recently.
一个简单的例子会很有帮助!我最近开始写 Scala。
采纳答案by Alberto Bonsanto
If you want to sumall values of one column, it's more efficient to use DataFrame's internal RDDand reduce.
如果您想要sum一列的所有值,使用DataFrame的内部RDD和reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
回答by Daniel de Paula
You must first import the functions:
您必须首先导入函数:
import org.apache.spark.sql.functions._
Then you can use them like this:
然后你可以像这样使用它们:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
如果需要,您还可以投射结果:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
编辑:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
对于多列(例如“col1”、“col2”、...),您可以一次获取所有聚合:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
编辑2:
For dynamically applying the aggregations, the following options are available:
对于动态应用聚合,可以使用以下选项:
- Applying to all numeric columns at once:
- 一次应用于所有数字列:
df.groupBy().sum()
- Applying to a list of numeric column names:
- 应用于数字列名称列表:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
- Applying to a list of numeric column names with aliases and/or casts:
- 应用于具有别名和/或强制转换的数字列名称列表:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
回答by shankarj67
Simply apply aggregation function, Sumon your column
只需在您的列上应用聚合函数Sum
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
按照文档http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
也请查看此链接https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
回答by Marcos
Not sure this was around when this question was asked but:
不确定在问这个问题时是否存在这种情况,但是:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
给出列的均值、计数、标准差统计信息。我认为如果你只是这样做,它会在所有列上返回.show()
回答by Omkar
Using spark sql query..just incase if it helps anyone!
使用 spark sql 查询 .. 以防万一它对任何人都有帮助!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

