SQL HIVE 中的 LIMIT 子句真的是随机的吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23802115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is LIMIT clause in HIVE really random?
提问by visakh
The documentationof HIVE
notes that LIMIT
clause returns rows chosen at random
. I have been running a SELECT
table on a table with more than 800,000
records with LIMIT 1
, but it always return me the same record.
该文件中HIVE
指出,LIMIT
条款returns rows chosen at random
。我一直SELECT
在一个包含多个800,000
记录的表上运行一个表LIMIT 1
,但它总是返回相同的记录。
I'm using the Shark
distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.
我正在使用该Shark
发行版,我想知道这是否与这种非预期行为有关?任何想法将不胜感激。
Thanks, Visakh
谢谢,维萨赫
采纳答案by user3036342
Even though the documentation states it returns rows at random, it's not actually true.
尽管文档说明它随机返回行,但实际上并非如此。
It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.
它返回“随机选择的行”,因为它出现在数据库中,没有任何 where/order by 子句。这意味着它并不像您想象的那样真正随机(或随机选择),只是无法确定返回行的顺序。
As soon as you slap a order by x DESC limit 5
on there, it returns the last 5 rows of whatever you're selecting from.
一旦你order by x DESC limit 5
在那里打了一个耳光,它就会返回你选择的任何东西的最后 5 行。
To get rows returned at random, you would need to use something like: order by rand() LIMIT 1
要随机返回行,您需要使用以下内容: order by rand() LIMIT 1
However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset
但是,如果您的索引设置不正确,它可能会对速度产生影响。通常我做一个最小/最大来获取表上的 ID,然后在它们之间做一个随机数,然后选择这些记录(在你的情况下,只有 1 条记录),这往往比让数据库更快工作,尤其是在大型数据集上
回答by Keith
To be safe you want to use
为了安全起见,你想使用
select * from table
distribute by rand()
sort by rand()
limit 10000;
从表中选择 *
通过 rand() 分配
按 rand() 排序
限制 10000;
回答by Rick
The documentation may have been updated since this question was originally posted in 2014, but as of December,2017, the documentation now reads, "The following query returns 5 arbitrary customers".
自此问题最初于 2014 年发布以来,文档可能已更新,但截至 2017 年 12 月,文档现在显示为“以下查询返回 5 个任意客户”。
In this case, "arbitrary" means method of selecting either is not deterministic or may not be worth the trouble to document. In other words, you shouldn't count on it as a reliable method for getting specific subset of records (e.g., for sampling). You should only use the Limit clause without an Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (e.g., for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By/Sort By as appropriate.
在这种情况下,“任意”意味着选择的方法不是确定性的,或者可能不值得麻烦记录。换句话说,您不应该指望它是获取特定记录子集(例如,用于采样)的可靠方法。如果您正在寻找权宜之计并希望尽快获得一个小的结果集(例如,为了 QA 目的),您应该只使用 Limit 子句而不使用 Order By 子句。否则,根据需要使用 Order By、Cluster By 或 Distribute By/Sort By 之一。