SQL HIVE 中的 LIMIT 子句真的是随机的吗？

Question

提问by visakh

The documentationof HIVEnotes that LIMITclause returns rows chosen at random. I have been running a SELECTtable on a table with more than 800,000records with LIMIT 1, but it always return me the same record.

该文件中HIVE指出，LIMIT条款returns rows chosen at random。我一直SELECT在一个包含多个800,000记录的表上运行一个表LIMIT 1，但它总是返回相同的记录。

I'm using the Sharkdistribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

我正在使用该Shark发行版，我想知道这是否与这种非预期行为有关？任何想法将不胜感激。

Thanks, Visakh

谢谢，维萨赫

Answer 1

采纳答案by user3036342

Even though the documentation states it returns rows at random, it's not actually true.

尽管文档说明它随机返回行，但实际上并非如此。

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

它返回“随机选择的行”，因为它出现在数据库中，没有任何 where/order by 子句。这意味着它并不像您想象的那样真正随机（或随机选择），只是无法确定返回行的顺序。

As soon as you slap a order by x DESC limit 5on there, it returns the last 5 rows of whatever you're selecting from.

一旦你order by x DESC limit 5在那里打了一个耳光，它就会返回你选择的任何东西的最后 5 行。

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

要随机返回行，您需要使用以下内容： order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

但是，如果您的索引设置不正确，它可能会对速度产生影响。通常我做一个最小/最大来获取表上的 ID，然后在它们之间做一个随机数，然后选择这些记录（在你的情况下，只有 1 条记录），这往往比让数据库更快工作，尤其是在大型数据集上

Answer 2

回答by Keith

To be safe you want to use

为了安全起见，你想使用

select * from table
distribute by rand()
sort by rand()
limit 10000;

从表中选择 *
通过 rand() 分配
按 rand() 排序
限制 10000；

Answer 3

回答by Rick

The documentation may have been updated since this question was originally posted in 2014, but as of December,2017, the documentation now reads, "The following query returns 5 arbitrary customers".

自此问题最初于 2014 年发布以来，文档可能已更新，但截至 2017 年 12 月，文档现在显示为“以下查询返回 5 个任意客户”。

In this case, "arbitrary" means method of selecting either is not deterministic or may not be worth the trouble to document. In other words, you shouldn't count on it as a reliable method for getting specific subset of records (e.g., for sampling). You should only use the Limit clause without an Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (e.g., for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By/Sort By as appropriate.

在这种情况下，“任意”意味着选择的方法不是确定性的，或者可能不值得麻烦记录。换句话说，您不应该指望它是获取特定记录子集（例如，用于采样）的可靠方法。如果您正在寻找权宜之计并希望尽快获得一个小的结果集（例如，为了 QA 目的），您应该只使用 Limit 子句而不使用 Order By 子句。否则，根据需要使用 Order By、Cluster By 或 Distribute By/Sort By 之一。

SQL HIVE 中的 LIMIT 子句真的是随机的吗？

提问by visakh

采纳答案by user3036342

回答by Keith

回答by Rick

相关推荐

最近更新

标签

SQL HIVE 中的 LIMIT 子句真的是随机的吗？

提问by visakh

采纳答案by user3036342

回答by Keith

回答by Rick

相关推荐

用于查找无效电子邮件地址的 Sql 脚本

SQL PostgreSQL“IF”语法错误

通过 Oracle SQL 查询将行中的列的逗号分隔值拆分

在 T-SQL 中连接到远程服务器 (SQL server 2008)

相关推荐

最近更新

标签