scala 在 spark.sql 中使用 group by 选择多个元素
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/41421675/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
select multiple elements with group by in spark.sql
提问by rahul
is there any way to group by table in sql spark which selects multiple elements code i am using:
有没有办法在 sql spark 中按表分组,它选择我正在使用的多个元素代码:
val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")
now doing group by like :
现在按如下方式分组:
val sqlDF = spark.sql(
  "SELECT count(customerId) FROM GETBYID group by customerId");
but when I try:
但是当我尝试时:
val sqlDF = spark.sql(
  "SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");
Spark gives an error :
Spark给出错误:
org.apache.spark.sql.AnalysisException: expression 'getbyid.
userId' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
org.apache.spark.sql.AnalysisException: 表达式 'getbyid.
userId' 既不存在于 group by 中,也不是聚合函数。如果您不关心获得的值,请添加到 group by 或包装在 first() (或 first_value)中。
is there any possible way to do that
有没有办法做到这一点
回答by Mariusz
Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userIdto groupBy:
是的,这是可能的,您附加的错误消息描述了所有可能性。您可以添加userId到 groupBy:
val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");
or use first():
或使用first():
val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");
回答by Farah
And if you want to keep all the occurences of userId, you can do this :
如果你想保留 userId 的所有出现,你可以这样做:
spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")
By using collect_list.
通过使用 collect_list。

