数据框：如何分组/计数然后在 Scala 中过滤计数

Question

提问by user3646671

Spark 1.4.1

火花 1.4.1

I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below

我遇到一种情况，按数据框分组，然后对“计数”列进行计数和过滤会引发以下异常

import sqlContext.implicits._
import org.apache.spark.sql._

case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()

Then grouping and filtering:

然后分组和过滤：

df.groupBy("x").count()
  .filter("count >= 2")
  .show()

Throws an exception:

抛出异常：

java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2

Solution:

解决方案：

Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'

重命名列会使问题消失（因为我怀疑与插入的“计数”函数没有冲突

df.groupBy("x").count()
  .withColumnRenamed("count", "n")
  .filter("n >= 2")
  .show()

So, is that a behavior to expect, a bug or is there a canonical way to go around?

那么，这是一种预期的行为，一个错误还是有一种规范的方法可以解决？

thanks, alex

谢谢，亚历克斯

Answer 1

回答by Herman

When you pass a string to the filterfunction, the string is interpreted as SQL. Count is a SQL keyword and using countas a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).

当您将字符串传递给filter函数时，该字符串将被解释为 SQL。Count 是一个 SQL 关键字，count用作变量会混淆解析器。这是一个小错误（如果您愿意，可以提交 JIRA 票证）。

You can easily avoid this by using a column expression instead of a String:

您可以通过使用列表达式而不是字符串来轻松避免这种情况：

df.groupBy("x").count()
  .filter($"count" >= 2)
  .show()

Answer 2

回答by zero323

So, is that a behavior to expect, a bug

所以，这是一种预期的行为，一个错误

Truth be told I am not sure. It looks like parser is interpreting countnot as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.

说实话我不确定。看起来解析器count不是将列名解释为函数，而是将其解释为后面的括号。看起来像是一个错误或至少是解析器的严重限制。

is there a canonical way to go around?

有没有规范的方法来解决？

Some options have been already mentioned by Hermanand mattinbitsso here more SQLish approach from me:

Herman和mattinbits已经提到了一些选项，所以这里有更多来自我的 SQLish 方法：

import org.apache.spark.sql.functions.count

df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt"  > 2)

Answer 3

回答by mattinbits

I think a solution is to put count in back ticks

我认为一个解决方案是把计数放在后面

.filter("`count` >= 2")

http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF@fmsmsx124.amr.corp.intel.com%3E

数据框：如何分组/计数然后在 Scala 中过滤计数

提问by user3646671

回答by Herman

回答by zero323

回答by mattinbits

相关推荐

最近更新

标签

数据框：如何分组/计数然后在 Scala 中过滤计数

提问by user3646671

回答by Herman

回答by zero323

回答by mattinbits

相关推荐

scala 如何将新的 Struct 列添加到 DataFrame

scala 期货 - 地图与平面地图

scala 如何将数据帧拆分为具有相同列值的数据帧？

scala 使用 spark-csv 编写单个 CSV 文件

相关推荐

最近更新

标签