SQL 无法在sparksql中选择每组前10条记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44923353/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
unable to select top 10 records per group in sparksql
提问by Teju Priya
Hi I am new to spark sql. I have a data frame like this.
嗨,我是 Spark sql 的新手。我有一个这样的数据框。
---+----------+----+----+----+------------------------+
|tag id|timestamp|listner| orgid |org2id|RSSI
+---+----------+----+----+----+------------------------+
| 4|1496745912| 362| 4| 3| 0.60|
| 4|1496745924|1901| 4| 3| 0.60|
| 4|1496746030|1901| 4| 3| 0.60|
| 4|1496746110| 718| 4| 3| 0.30|
| 2|1496746128| 718| 4| 3| 0.60|
| 2|1496746188|1901| 4| 3| 0.10|
I want to select for each listner top 10 timestamp values in spark sql.
我想在 spark sql 中为每个监听器选择前 10 个时间戳值。
I tried the following query.It throws errors.
我尝试了以下查询。它会引发错误。
val avg = sqlContext.sql("select top 10 * from avg_table") // throws error.
val avg = sqlContext.sql("select rssi,timestamp,tagid from avg_table order by desc limit 10") // it prints only 10 records.
I want to select for each listner I need to take top 10 timestamp values. Any help will be appreciated.
我想为每个需要取前 10 个时间戳值的监听器选择。任何帮助将不胜感激。
回答by Gordon Linoff
Doesn't this work?
这不行吗?
select rssi, timestamp, tagid
from avg_table
order by timestamp desc
limit 10;
EDIT:
编辑:
Oh, I get it. You want row_number()
:
哦,我明白了。你想要row_number()
:
select rssi, timestamp, tagid
from (select a.*,
row_number() over (partition by listner order by timestamp desc) as seqnum
from avg_table
) a
where seqnum <= 10
order by a.timestamp desc;
回答by bajran
Here we can used dense_rank also
这里我们也可以使用dense_rank
select *
from (select *,
dense_rank() over (partition by listner order by timestamp) as rank
from avg_table
)
where rank <= 10;
Difference Between dense_rank()and row_number()is dense_rank() provide the same rank/number to matching column[on partitioned is done] values in multiple row where as row_number() provide the unique row number/rank to matching column values in multiple row
Dense_rank()和row_number()之间的区别是 Dense_rank() 为多行中的匹配列提供相同的排名/编号[分区完成] 值,而 row_number() 为多个匹配的列值提供唯一的行号/排名排
Thanks
谢谢
回答by Yehor Krivokon
Use "limit" in your query. (limit 10 in your case)
在您的查询中使用“限制”。(在您的情况下限制为 10)
EXAMPLE: sqlContext.sql("SELECT text FROM yourTable LIMIT 10")
Or you can select all from your table and save result to DataFrame or DataSet (or to RDD, but then you need to call rdd.toDS() or to DF() method) Then you can just call show(10) method.
或者您可以从表中选择所有内容并将结果保存到 DataFrame 或 DataSet(或保存到 RDD,但是您需要调用 rdd.toDS() 或 DF() 方法)然后您可以调用 show(10) 方法。