从特定列 scala spark 数据框中获取最小值和最大值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/43232363/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
get min and max from a specific column scala spark dataframe
提问by Laure D
I would like to access to the min and max of a specific column from my dataframe but I don't have the header of the column, just its number, so I should I do using scala ?
我想从我的数据框中访问特定列的最小值和最大值,但我没有列的标题,只有它的编号,所以我应该使用 scala 吗?
maybe something like this :
也许是这样的:
val q = nextInt(ncol) //we pick a random value for a column number
col = df(q)
val minimum = col.min()
Sorry if this sounds like a silly question but I couldn't find any info on SO about this question :/
对不起,如果这听起来像一个愚蠢的问题,但我找不到关于这个问题的任何信息:/
回答by Justin Pihony
How about getting the column name from the metadata:
如何从元数据中获取列名:
val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))
回答by Tautvydas
You can use pattern matching while assigning variable:
您可以在分配变量时使用模式匹配:
import org.apache.spark.sql.functions.{min, max}
import org.apache.spark.sql.Row
val Row(minValue: Double, maxValue: Double) = df.agg(min(q), max(q)).head
Where q is either a Columnor a name of column (String). Assuming your data type is Double.
其中 q 是Column列名或列名(字符串)。假设您的数据类型是Double.
回答by Psidom
You can use the column number to extract the column names first (by indexing df.columns), then aggregate use the column names:
您可以首先使用列号提取列名(通过索引df.columns),然后聚合使用列名:
val df = Seq((2.0, 2.1), (1.2, 1.4)).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: double]
df.agg(max(df(df.columns(1))), min(df(df.columns(1)))).show
+------+------+
|max(B)|min(B)|
+------+------+
|   2.1|   1.4|
+------+------+
回答by stackoverflowuser2010
Here is a direct way to get the min and max from a dataframe with column names:
这是从具有列名的数据框中获取最小值和最大值的直接方法:
val df = Seq((1, 2), (3, 4), (5, 6)).toDF("A", "B")
df.show()
/*
+---+---+
|  A|  B|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+
*/
df.agg(min("A"), max("A")).show()
/*
+------+------+
|min(A)|max(A)|
+------+------+
|     1|     5|
+------+------+
*/
If you want to get the min and max values as separate variables, then you can convert the result of agg()above into a Rowand use Row.getInt(index)to get the column values of the Row.
如果要将最小值和最大值作为单独的变量获取,则可以将上述结果转换agg()为 aRow并用于Row.getInt(index)获取Row.
val min_max = df.agg(min("A"), max("A")).head()
// min_max: org.apache.spark.sql.Row = [1,5]
val col_min = min_max.getInt(0)
// col_min: Int = 1
val col_max = min_max.getInt(1)
// col_max: Int = 5
回答by Aman Sehgal
Using spark functions min and max, you can find min or max values for any column in a data frame.
使用火花函数 min 和 max,您可以找到数据框中任何列的最小值或最大值。
import org.apache.spark.sql.functions.{min, max}
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
df.agg(max($"A"), min($"B")).show()
/*
+------+------+
|max(A)|min(B)|
+------+------+
|    10|     1|
+------+------+
*/
回答by Priyanshu Singh
Hope this will help
希望这会有所帮助
val sales=sc.parallelize(List(
   ("West",  "Apple",  2.0, 10),
   ("West",  "Apple",  3.0, 15),
   ("West",  "Orange", 5.0, 15),
   ("South", "Orange", 3.0, 9),
   ("South", "Orange", 6.0, 18),
   ("East",  "Milk",   5.0, 5)))
val salesDf= sales.toDF("store","product","amount","quantity")
salesDf.registerTempTable("sales") 
val result=spark.sql("SELECT store, product, SUM(amount), MIN(amount), MAX(amount), SUM(quantity) from sales GROUP BY store, product")
//OR
salesDf.groupBy("store","product").agg(min("amount"),max("amount"),sum("amount"),sum("quantity")).show
//output
    +-----+-------+-----------+-----------+-----------+-------------+
    |store|product|min(amount)|max(amount)|sum(amount)|sum(quantity)|
    +-----+-------+-----------+-----------+-----------+-------------+
    |South| Orange|        3.0|        6.0|        9.0|           27|
    | West| Orange|        5.0|        5.0|        5.0|           15|
    | East|   Milk|        5.0|        5.0|        5.0|            5|
    | West|  Apple|        2.0|        3.0|        5.0|           25|
    +-----+-------+-----------+-----------+-----------+-------------+
回答by ForeverLearner
In Java, we have to explicitly mention org.apache.spark.sql.functionsthat has implementation for minand max:
在 Java 中,我们必须明确提到org.apache.spark.sql.functions具有min和 的实现max:
datasetFreq.agg(functions.min("Frequency"), functions.max("Frequency")).show();

