scala 使用scala计算spark数据帧中列组合的实例

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33393815/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:45:36  来源:igfitidea点击:

Count instances of combination of columns in spark dataframe using scala

scalaapache-sparkdataframe

提问by Dean

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

我在 scala 中有一个叫做 df 的 spark 数据框,有两列,比如 a 和 b。a 列包含字母,b 列包含给出以下内容的数字。

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

我可以使用不同的行

val dfDistinct=df.select("a","b").distinct

which gives the following:

这给出了以下内容:

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

我想添加另一列,其中包含这些不同组合在第一个数据框中出现的次数,因此我最终得到

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

我不介意这是否会修改原始命令,或者我对 dfDistinct 进行了单独的操作以提供另一个数据框。

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

任何建议都非常感谢,我为这个问题的微不足道的性质道歉,但我不是 Scala 或 Spark 中这种操作最有经验的人。

Thanks

谢谢

Dean

院长

回答by zero323

You can simply aggregate and count:

您可以简单地聚合和计数:

df.groupBy($"a", $"b").count

or a little bit more verbose:

或者更详细一点:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

两者都相当于原始 SQL 聚合:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")

回答by oluies

Also see Cross Tabulation

另见交叉表

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|