MySQL“IN”运算符在(大?)数量的值上的性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4514697/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 18:05:30  来源:igfitidea点击:

MySQL "IN" operator performance on (large?) number of values

mysqlsqlperformanceoperators

提问by Michael van Rooijen

I have been experimenting with Redis and MongoDB lately and it would seem that there are often cases where you would store an array of id'sin either MongoDB or Redis. I'll stick with Redis for this question since I am asking about the MySQL INoperator.

我最近一直在试验 Redis 和 MongoDB,似乎经常会在 MongoDB 或 Redis 中存储一组id 的情况。由于我在询问 MySQL IN运算符,因此我将坚持使用 Redis 解决这个问题。

I was wondering how performant it is to list a large number (300-3000) of id'sinside the IN operator, which would look something like this:

我想知道在 IN 运算符中列出大量 (300-3000) id 的性能如何,看起来像这样:

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)

Imagine something as simple as a productsand categoriestable which you might normally JOIN together to get the productsfrom a certain category. In the example above you can see that under a given category in Redis ( category:4:product_ids) I return all the product ids from the category with id 4, and place them in the above SELECTquery inside the INoperator.

想象一下像产品类别表这样简单的事情,您通常可以将它们联接在一起以获取某个类别产品。在上面的示例中,您可以看到在 Redis ( ) 中的给定类别下,我返回了 id 为 4 的类别中的所有产品 ID,并将它们放在上面的查询中的运算符中。category:4:product_idsSELECTIN

How performant is this?

这性能如何?

Is this an "it depends" situation? Or is there a concrete "this is (un)acceptable" or "fast" or "slow" or should I add a LIMIT 25, or doesn't that help?

这是一种“视情况而定”的情况吗?或者是否有具体的“这是(不)可接受的”或“快”或“慢”,或者我应该添加一个LIMIT 25,还是没有帮助?

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25

Or should I trim the array of product id's returned by Redis to limit it to 25 and only add 25 id's to the query rather than 3000 and LIMIT-ing it to 25 from inside the query?

或者我应该修剪 Redis 返回的产品 id 数组以将其限制为 25,并且只将 25 个 id 添加到查询而不是 3000 并LIMIT从查询内部将其添加到 25?

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)

Any suggestions/feedback is much appreciated!

非常感谢任何建议/反馈!

采纳答案by Jonathan Leffler

Generally speaking, if the INlist gets too large (for some ill-defined value of 'too large' that is usually in the region of 100 or smaller), it becomes more efficient to use a join, creating a temporary table if need so be to hold the numbers.

一般来说,如果IN列表变得太大(对于一些定义不明确的“太大”值,通常在 100 或更小的区域内),使用连接会变得更有效,如果需要,创建一个临时表保持数字。

If the numbers are a dense set (no gaps - which the sample data suggests), then you can do even better with WHERE id BETWEEN 300 AND 3000.

如果数字是密集集(没有间隙 - 样本数据表明),那么您可以使用WHERE id BETWEEN 300 AND 3000.

However, presumably there are gaps in the set, at which point it may be better to go with the list of valid values after all (unless the gaps are relatively few in number, in which case you could use:

但是,大概集合中存在间隙,此时最好使用有效值列表(除非间隙数量相对较少,在这种情况下您可以使用:

WHERE id BETWEEN 300 AND 3000 AND id NOT BETWEEN 742 AND 836

Or whatever the gaps are.

或者不管有什么差距。

回答by jbaylina

I have been doing some tests, and as David Fells says in his answer, it is quite well optimized. As a reference, I have created an InnoDB table with 1,000,000 registers and doing a select with the "IN" operator with 500,000 random numbers, it takes only 2.5 seconds on my MAC; selecting only the even registers takes 0.5 seconds.

我一直在做一些测试,正如 David Fells 在他的回答中所说,它已经得到了很好的优化。作为参考,我创建了一个包含 1,000,000 个寄存器的 InnoDB 表,并使用带有 500,000 个随机数的“IN”运算符进行选择,在我的 MAC 上只需要 2.5 秒;仅选择偶数寄存器需要 0.5 秒。

The only problem that I had is that I had to increase the max_allowed_packetparameter from the my.cnffile. If not, a mysterious “MYSQL has gone away” error is generated.

我遇到的唯一问题是我必须增加文件中的max_allowed_packet参数my.cnf。如果没有,就会产生一个神秘的“MYSQL has been away”错误。

Here is the PHP code that I use to make the test:

这是我用来进行测试的 PHP 代码:

$NROWS =1000000;
$SELECTED = 50;
$NROWSINSERT =15000;

$dsn="mysql:host=localhost;port=8889;dbname=testschema";
$pdo = new PDO($dsn, "root", "root");
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$pdo->exec("drop table if exists `uniclau`.`testtable`");
$pdo->exec("CREATE  TABLE `testtable` (
        `id` INT NOT NULL ,
        `text` VARCHAR(45) NULL ,
        PRIMARY KEY (`id`) )");

$before = microtime(true);

$Values='';
$SelValues='(';
$c=0;
for ($i=0; $i<$NROWS; $i++) {
    $r = rand(0,99);
    if ($c>0) $Values .= ",";
    $Values .= "( $i , 'This is value $i and r= $r')";
    if ($r<$SELECTED) {
        if ($SelValues!="(") $SelValues .= ",";
        $SelValues .= $i;
    }
    $c++;

    if (($c==100)||(($i==$NROWS-1)&&($c>0))) {
        $pdo->exec("INSERT INTO `testtable` VALUES $Values");
        $Values = "";
        $c=0;
    }
}
$SelValues .=')';
echo "<br>";


$after = microtime(true);
echo "Insert execution time =" . ($after-$before) . "s<br>";

$before = microtime(true);  
$sql = "SELECT count(*) FROM `testtable` WHERE id IN $SelValues";
$result = $pdo->prepare($sql);  
$after = microtime(true);
echo "Prepare execution time =" . ($after-$before) . "s<br>";

$before = microtime(true);

$result->execute();
$c = $result->fetchColumn();

$after = microtime(true);
echo "Random selection = $c Time execution time =" . ($after-$before) . "s<br>";



$before = microtime(true);

$sql = "SELECT count(*) FROM `testtable` WHERE id %2 = 1";
$result = $pdo->prepare($sql);
$result->execute();
$c = $result->fetchColumn();

$after = microtime(true);
echo "Pairs = $c Exdcution time=" . ($after-$before) . "s<br>";

And the results:

结果:

Insert execution time =35.2927210331s
Prepare execution time =0.0161771774292s
Random selection = 499102 Time execution time =2.40285992622s
Pairs = 500000 Exdcution time=0.465420007706s

回答by Vladimir Jotov

You can create a temporary table where you can put any number of IDs and run nested query Example:

您可以创建一个临时表,您可以在其中放置任意数量的 ID 并运行嵌套查询示例:

CREATE [TEMPORARY] TABLE tmp_IDs (`ID` INT NOT NULL,PRIMARY KEY (`ID`));

and select:

并选择:

SELECT id, name, price
FROM products
WHERE id IN (SELECT ID FROM tmp_IDs);

回答by David Fells

INis fine, and well optimized. Make sure you use it on an indexed field and you're fine.

IN很好,而且优化得很好。确保你在索引字段上使用它,你很好。

It's functionally equivalent to:

它在功能上等同于:

(x = 1 OR x = 2 OR x = 3 ... OR x = 99)

As far as the DB engine is concerned.

就数据库引擎而言。

回答by yoyodunno

Using INwith a large parameter set on a large list of records will in fact be slow.

IN在大量记录上使用大量参数集实际上会很慢。

In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.

在我最近解决的情况下,我有两个 where 子句,一个有 2,50 个参数,另一个有 3,500 个参数,查询 4000 万条记录的表。

My query took 5 minutes using the standard WHERE IN. By instead using a subquery for the INstatement (putting the parameters in their own indexed table), I got the query down to TWO seconds.

我的查询使用标准WHERE IN. 通过对IN语句使用子查询(将参数放在它们自己的索引表中),我将查询缩短到两秒。

Worked for both MySQL and Oracle in my experience.

根据我的经验,曾为 MySQL 和 Oracle 工作过。

回答by Jarekczek

When you provide many values for the INoperator it first must sort it to remove duplicates. At least I suspect that. So it would be not good to provide too many values, as sorting takes N log N time.

当您为IN运算符提供许多值时,它首先必须对其进行排序以删除重复项。至少我怀疑。所以提供太多的值是不好的,因为排序需要 N log N 时间。

My experience proved that slicing the set of values into smaller subsets and combining the results of all the queries in the application gives best performance. I admit that I gathered experience on a different database (Pervasive), but the same may apply to all the engines. My count of values per set was 500-1000. More or less was significantly slower.

我的经验证明,将一组值切成更小的子集并组合应用程序中所有查询的结果可提供最佳性能。我承认我在不同的数据库 (Pervasive) 上积累了经验,但这同样适用于所有引擎。我每组的值数是 500-1000。或多或少明显变慢了。