MySQL 的 ORDER BY RAND() 如何工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2663710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 15:49:03  来源:igfitidea点击:

How does MySQL's ORDER BY RAND() work?

mysqlselectrandom

提问by Eugene

I've been doing some research and testing on how to do fast random selection in MySQL. In the process I've faced some unexpected results and now I am not fully sure I know how ORDER BY RAND() really works.

我一直在做一些关于如何在 MySQL 中进行快速随机选择的研究和测试。在这个过程中,我遇到了一些意想不到的结果,现在我不完全确定我知道 ORDER BY RAND() 是如何工作的。

I always thought that when you do ORDER BY RAND() on the table, MySQL adds a new column to the table which is filled with random values, then it sorts data by that column and then e.g. you take the above value which got there randomly. I've done lots of googling and testing and finally found that the query Jay offers in his blogis indeed the fastest solution:

我一直认为,当您在表上执行 ORDER BY RAND() 时,MySQL 会向表中添加一个新列,该列填充有随机值,然后它会按该列对数据进行排序,然后例如您采用上述随机值. 我做了很多谷歌搜索和测试,最后发现Jay 在他的博客中提供的查询确实是最快的解决方案:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

While common ORDER BY RAND() takes 30-40 seconds on my test table, his query does the work in 0.1 seconds. He explains how this functions in the blog so I'll just skip this and finally move to the odd thing.

虽然常见的 ORDER BY RAND() 在我的测试表上需要 30-40 秒,但他的查询在 0.1 秒内完成工作。他在博客中解释了它是如何运作的,所以我会跳过这个,最后转向奇怪的事情。

My table is a common table with a PRIMARY KEY idand other non-indexed stuff like username, age, etc. Here's the thing I am struggling to explain

我的表是用PRIMARY KEY公用表id和其他非索引的东西一样usernameage等这里是我奋力解释的东西

SELECT * FROM table ORDER BY RAND() LIMIT 1; /*30-40 seconds*/
SELECT id FROM table ORDER BY RAND() LIMIT 1; /*0.25 seconds*/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /*90 seconds*/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this. I have a project where I need to do fast ORDER BY RAND() and personally I would prefer to use

我有点期望看到所有三个查询的时间大致相同,因为我总是对单个列进行排序。但由于某种原因,这并没有发生。如果您对此有任何想法,请告诉我。我有一个项目,我需要按 RAND() 进行快速 ORDER,我个人更喜欢使用

SELECT id FROM table ORDER BY RAND() LIMIT 1;
SELECT * FROM table WHERE id=ID_FROM_PREVIOUS_QUERY LIMIT 1;

which, yes, is slower than Jay's method, however it is smaller and easier to understand. My queries are rather big ones with several JOINs and with WHERE clause and while Jay's method still works, the query grows really big and complex because I need to use all the JOINs and WHERE in the JOINed (called x in his query) sub request.

是的,它比 Jay 的方法慢,但是它更小更容易理解。我的查询相当大,有几个 JOIN 和 WHERE 子句,虽然 Jay 的方法仍然有效,但查询变得非常大和复杂,因为我需要使用 JOINed(在他的查询中称为 x)子请求中的所有 JOIN 和 WHERE。

Thanks for your time!

谢谢你的时间!

采纳答案by Tor Valamo

While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.

虽然没有“通过 rand() 快速订购”这样的东西,但对于您的特定任务有一种解决方法。

For getting any single random row, you can do like this german blogger does: http://www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/(I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)

要获得任何单个随机行,您可以像这位德国博主所做的那样:http: //www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/(我看不到一个热链接网址。如果有人看到,请随时编辑链接。)

The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.

文本是德语,但 SQL 代码在页面下方并在大白框中,因此不难看到。

Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.

基本上他所做的是制作一个程序来完成获取有效行的工作。这会生成一个介于 0 和 max_id 之间的随机数,尝试获取一行,如果它不存在,则继续操作,直到找到一个。他允许通过将它们存储在临时表中来获取 x 数量的随机行,因此您可能可以重写该过程以更快地仅获取一行。

The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.

这样做的缺点是,如果你删除了很多行,并且存在巨大的间隙,那么它很可能会错过很多次,从而使其无效。

Update: Different execution times

更新:不同的执行时间

SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/

SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.

SELECT * FROM table ORDER BY RAND() LIMIT 1; / 30-40 秒/

SELECT id FROM table ORDER BY RAND() LIMIT 1; / 0.25 秒/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1; / 90 秒/

我有点期望看到所有三个查询的时间大致相同,因为我总是对单个列进行排序。但由于某种原因,这并没有发生。如果您对此有任何想法,请告诉我。

It may have to do with indexing. idis indexed and quick to access, whereas adding usernameto the result, means it needs to read that from each row and put it in the memory table. With the *it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.

它可能与索引有关。id被索引并且可以快速访问,而添加username到结果中,意味着它需要从每一行读取它并将其放入内存表中。有了*它,它还必须将所有内容读入内存,但它不需要在数据文件中跳转,这意味着不会浪费时间寻找。

This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.

仅当存在可变长度列 (varchar/text) 时,这才会有所不同,这意味着它必须检查长度,然后跳过该长度,而不是仅跳过每行之间的设置长度(或 0)。

回答by Andrey Frolov

It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking. This makes a difference only if there are variable length columns, which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row

它可能与索引有关。id 被索引并且可以快速访问,而将用户名添加到结果中,意味着它需要从每一行读取它并将其放入内存表中。使用 * 它还必须将所有内容读入内存,但它不需要跳过数据文件,这意味着不会浪费时间寻找。仅当存在可变长度的列时,这才会有所不同,这意味着它必须检查长度,然后跳过该长度,而不是在每行之间跳过设定的长度(或 0)

Practice is better that all theories! Why not just to check plans? :)

实践胜过一切理论!为什么不只是检查计划?:)

mysql> explain select name from avatar order by RAND() limit 1;
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| id | select_type | table  | type  | possible_keys | key             | key_len | ref  | rows  | Extra                                        |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | avatar | index | NULL          | IDX_AVATAR_NAME | 302     | NULL | 30062 | Using index; Using temporary; Using filesort |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
1 row in set (0.00 sec)

mysql> explain select * from avatar order by RAND() limit 1;
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | avatar | ALL  | NULL          | NULL | NULL    | NULL | 30062 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
1 row in set (0.00 sec)

 mysql> explain select name, experience from avatar order by RAND() limit 1;
+----+-------------+--------+------+--------------+------+---------+------+-------+---------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | avatar | ALL  | NULL          | NULL | NULL    | NULL | 30064 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+

回答by jmoz

Why don't you add an index id, usernameon the table see if that forces mysql to use the index rather than just a filesort and temp table.

为什么不在id, username表上添加索引,看看是否会强制 mysql 使用索引而不仅仅是文件排序和临时表。

回答by newtover

I can tell you why the SELECT id FROM ...is much slower than the other two, but I am not sure, why SELECT id, usernameis 2-3 times faster than SELECT *.

我可以告诉你为什么SELECT id FROM ...比其他两个慢得多,但我不确定,为什么SELECT id, usernameSELECT *.

When you have an index (the primary key in your case) and the result includes only the columns from the index, MySQL optimizer is able to use the data from the index only, does not even look into the table itself. The more expensive is each row, the more effect you will observe, since you substitute the filesystem IO operations with pure in-memory operations. If you will have an additional index on (id, username), you will have a similar performance in the third case as well.

当您有一个索引(在您的情况下是主键)并且结果仅包含索引中的列时,MySQL 优化器只能使用索引中的数据,甚至不查看表本身。每一行的开销越大,您观察到的效果就越大,因为您将文件系统 IO 操作替换为纯内存操作。如果您将在 (id, username) 上有一个额外的索引,您在第三种情况下也会有类似的表现。