scala 使用scala计算spark数据帧中列组合的实例
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33393815/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count instances of combination of columns in spark dataframe using scala
提问by Dean
I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.
我在 scala 中有一个叫做 df 的 spark 数据框,有两列,比如 a 和 b。a 列包含字母,b 列包含给出以下内容的数字。
a b
----------
g 0
f 0
g 0
f 1
I can get the distinct rows using
我可以使用不同的行
val dfDistinct=df.select("a","b").distinct
which gives the following:
这给出了以下内容:
a b
----------
g 0
f 0
f 1
I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with
我想添加另一列,其中包含这些不同组合在第一个数据框中出现的次数,因此我最终得到
a b count
----------
g 0 2
f 0 1
f 1 1
I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.
我不介意这是否会修改原始命令,或者我对 dfDistinct 进行了单独的操作以提供另一个数据框。
Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.
任何建议都非常感谢,我为这个问题的微不足道的性质道歉,但我不是 Scala 或 Spark 中这种操作最有经验的人。
Thanks
谢谢
Dean
院长
回答by zero323
You can simply aggregate and count:
您可以简单地聚合和计数:
df.groupBy($"a", $"b").count
or a little bit more verbose:
或者更详细一点:
import org.apache.spark.sql.functions.{count, lit}
df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))
Both are equivalent to a raw SQL aggregation:
两者都相当于原始 SQL 聚合:
df.registerTempTable("df")
sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
回答by oluies
Also see Cross Tabulation
另见交叉表
val g="g"
val f = "f"
val df = Seq(
(g, "0"),
(f, "0"),
(g, "0"),
(f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show
+---+---+---+
|a_b| 0| 1|
+---+---+---+
| g| 2| 0|
| f| 1| 1|

