SQL HIVE 中的 LIMIT 子句真的是随机的吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23802115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 01:55:49  来源:igfitidea点击:

Is LIMIT clause in HIVE really random?

sqlhivehiveqlshark-sql

提问by visakh

The documentationof HIVEnotes that LIMITclause returns rows chosen at random. I have been running a SELECTtable on a table with more than 800,000records with LIMIT 1, but it always return me the same record.

文件HIVE指出,LIMIT条款returns rows chosen at random。我一直SELECT在一个包含多个800,000记录的表上运行一个表LIMIT 1,但它总是返回相同的记录。

I'm using the Sharkdistribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

我正在使用该Shark发行版,我想知道这是否与这种非预期行为有关?任何想法将不胜感激。

Thanks, Visakh

谢谢,维萨赫

采纳答案by user3036342

Even though the documentation states it returns rows at random, it's not actually true.

尽管文档说明它随机返回行,但实际上并非如此。

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

它返回“随机选择的行”,因为它出现在数据库中,没有任何 where/order by 子句。这意味着它并不像您想象的那样真正随机(或随机选择),只是无法确定返回行的顺序。

As soon as you slap a order by x DESC limit 5on there, it returns the last 5 rows of whatever you're selecting from.

一旦你order by x DESC limit 5在那里打了一个耳光,它就会返回你选择的任何东西的最后 5 行。

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

要随机返回行,您需要使用以下内容: order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

但是,如果您的索引设置不正确,它可能会对速度产生影响。通常我做一个最小/最大来获取表上的 ID,然后在它们之间做一个随机数,然后选择这些记录(在你的情况下,只有 1 条记录),这往往比让数据库更快工作,尤其是在大型数据集上

回答by Keith

To be safe you want to use

为了安全起见,你想使用

select * from table

distribute by rand()

sort by rand()

limit 10000;

从表中选择 *

通过 rand() 分配

按 rand() 排序

限制 10000;

回答by Rick

The documentation may have been updated since this question was originally posted in 2014, but as of December,2017, the documentation now reads, "The following query returns 5 arbitrary customers".

自此问题最初于 2014 年发布以来,文档可能已更新,但截至 2017 年 12 月,文档现在显示为“以下查询返回 5 个任意客户”。

In this case, "arbitrary" means method of selecting either is not deterministic or may not be worth the trouble to document. In other words, you shouldn't count on it as a reliable method for getting specific subset of records (e.g., for sampling). You should only use the Limit clause without an Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (e.g., for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By/Sort By as appropriate.

在这种情况下,“任意”意味着选择的方法不是确定性的,或者可能不值得麻烦记录。换句话说,您不应该指望它是获取特定记录子集(例如,用于采样)的可靠方法。如果您正在寻找权宜之计并希望尽快获得一个小的结果集(例如,为了 QA 目的),您应该只使用 Limit 子句而不使用 Order By 子句。否则,根据需要使用 Order By、Cluster By 或 Distribute By/Sort By 之一。