scala 在 spark.sql 中使用 group by 选择多个元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41421675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:58:55  来源:igfitidea点击:

select multiple elements with group by in spark.sql

scalaapache-sparkapache-spark-sqlbigdata

提问by rahul

is there any way to group by table in sql spark which selects multiple elements code i am using:

有没有办法在 sql spark 中按表分组,它选择我正在使用的多个元素代码:

val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")

now doing group by like :

现在按如下方式分组:

val sqlDF = spark.sql(
  "SELECT count(customerId) FROM GETBYID group by customerId");

but when I try:

但是当我尝试时:

val sqlDF = spark.sql(
  "SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");

Spark gives an error :

Spark给出错误:

org.apache.spark.sql.AnalysisException: expression 'getbyid.userId' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

org.apache.spark.sql.AnalysisException: 表达式 'getbyid. userId' 既不存在于 group by 中,也不是聚合函数。如果您不关心获得的值,请添加到 group by 或包装在 first() (或 first_value)中。

is there any possible way to do that

有没有办法做到这一点

回答by Mariusz

Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userIdto groupBy:

是的,这是可能的,您附加的错误消息描述了所有可能性。您可以添加userId到 groupBy:

val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");

or use first():

或使用first()

val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");

回答by Farah

And if you want to keep all the occurences of userId, you can do this :

如果你想保留 userId 的所有出现,你可以这样做:

spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")

By using collect_list.

通过使用 collect_list。