scala 使用scala计算spark数据帧中列组合的实例

Question

提问by Dean

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

我在 scala 中有一个叫做 df 的 spark 数据框，有两列，比如 a 和 b。a 列包含字母，b 列包含给出以下内容的数字。

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

我可以使用不同的行

val dfDistinct=df.select("a","b").distinct

which gives the following:

这给出了以下内容：

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

我想添加另一列，其中包含这些不同组合在第一个数据框中出现的次数，因此我最终得到

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

我不介意这是否会修改原始命令，或者我对 dfDistinct 进行了单独的操作以提供另一个数据框。

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

任何建议都非常感谢，我为这个问题的微不足道的性质道歉，但我不是 Scala 或 Spark 中这种操作最有经验的人。

Thanks

谢谢

Dean

院长

Answer 1

回答by zero323

You can simply aggregate and count:

您可以简单地聚合和计数：

df.groupBy($"a", $"b").count

or a little bit more verbose:

或者更详细一点：

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

两者都相当于原始 SQL 聚合：

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")

Answer 2

回答by oluies

Also see Cross Tabulation

另见交叉表

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|

scala 使用scala计算spark数据帧中列组合的实例

提问by Dean

回答by zero323

回答by oluies

相关推荐

最近更新

标签

scala 使用scala计算spark数据帧中列组合的实例

提问by Dean

回答by zero323

回答by oluies

相关推荐

scala 激发多个上下文

可供 Jupyter/IPython 选择的众多 Spark/Scala 内核中的哪一个？

scala 使用 SBT 包在 JAR 中包含依赖项

Scala：滑动（N，N）与分组（N）

相关推荐

最近更新

标签