数据框:如何分组/计数然后在 Scala 中过滤计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32119936/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
dataframe: how to groupBy/count then filter on count in Scala
提问by user3646671
Spark 1.4.1
火花 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
我遇到一种情况,按数据框分组,然后对“计数”列进行计数和过滤会引发以下异常
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
然后分组和过滤:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
抛出异常:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
解决方案:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
重命名列会使问题消失(因为我怀疑与插入的“计数”函数没有冲突
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
那么,这是一种预期的行为,一个错误还是有一种规范的方法可以解决?
thanks, alex
谢谢,亚历克斯
回答by Herman
When you pass a string to the filterfunction, the string is interpreted as SQL. Count is a SQL keyword and using countas a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
当您将字符串传递给filter函数时,该字符串将被解释为 SQL。Count 是一个 SQL 关键字,count用作变量会混淆解析器。这是一个小错误(如果您愿意,可以提交 JIRA 票证)。
You can easily avoid this by using a column expression instead of a String:
您可以通过使用列表达式而不是字符串来轻松避免这种情况:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
回答by zero323
So, is that a behavior to expect, a bug
所以,这是一种预期的行为,一个错误
Truth be told I am not sure. It looks like parser is interpreting countnot as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
说实话我不确定。看起来解析器count不是将列名解释为函数,而是将其解释为后面的括号。看起来像是一个错误或至少是解析器的严重限制。
is there a canonical way to go around?
有没有规范的方法来解决?
Some options have been already mentioned by Hermanand mattinbitsso here more SQLish approach from me:
Herman和mattinbits已经提到了一些选项,所以这里有更多来自我的 SQLish 方法:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
回答by mattinbits
I think a solution is to put count in back ticks
我认为一个解决方案是把计数放在后面
.filter("`count` >= 2")

