MySQL INT 和 VARCHAR 主键之间是否存在真正的性能差异?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/332300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:27:24  来源:igfitidea点击:

Is there a REAL performance difference between INT and VARCHAR primary keys?

mysqlperformanceprimary-keyinnodbmyisam

提问by Jake McGraw

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.

在 MySQL 中使用 INT 与 VARCHAR 作为主键之间是否存在可测量的性能差异?我想使用 VARCHAR 作为参考列表的主键(想想美国各州、国家/地区代码),同事不会让 INT AUTO_INCREMENT 作为所有表的主键。

My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.

我的论点,正如这里详述的,是 INT 和 VARCHAR 之间的性能差异可以忽略不计,因为每个 INT 外键引用都需要一个 JOIN 来理解引用,一个 VARCHAR 键将直接显示信息。

So, does anyone have experience with this particular use-case and the performance concerns associated with it?

那么,有没有人对这个特定的用例以及与之相关的性能问题有经验?

采纳答案by Bill Karwin

You make a good point that you can avoid some number of joined queries by using what's called a natural keyinstead of a surrogate key. Only you can assess if the benefit of this is significant in your application.

您提出了一个很好的观点,即您可以通过使用所谓的自然键而不是代理键来避免一定数量的连接查询。只有您可以评估这样做的好处在您的应用程序中是否重要。

That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.

也就是说,您可以衡量应用程序中最重要的查询,因为它们处理大量数据或非常频繁地执行。如果这些查询从消除连接中受益,并且不会因使用 varchar 主键而受到影响,那么就去做吧。

Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.

不要对数据库中的所有表使用任何一种策略。在某些情况下,自然键可能更好,但在其他情况下,代理键可能更好。

Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.

其他人提出了一个很好的观点,即自然键在实践中很少发生变化或重复,因此代理键通常是值得的。

回答by Steve McLeod

It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

这与性能无关。这是关于什么是一个好的主键。独特且不随时间变化。您可能认为国家/地区代码之类的实体永远不会随着时间的推移而改变,并且是主键的理想候选者。但痛苦的经历是很少有的。

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.

INT AUTO_INCREMENT 满足“唯一且不随时间变化”的条件。因此偏好。

回答by Jan ?ankowski

I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.

我对在线缺乏基准测试感到有些恼火,因此我自己进行了测试。

Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.

但请注意,我不会在常规基础上执行此操作,因此请检查我的设置和步骤以了解可能无意中影响结果的任何因素,并将您的疑虑发表在评论中。

The setup was as follows:

设置如下:

  • Intel? Core? i7-7500U CPU @ 2.70GHz × 4
  • 15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
  • 148.6 GB SSD drive, with plenty of free space.
  • Ubuntu 16.04 64-bit
  • MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)
  • 英特尔?核?i7-7500U CPU @ 2.70GHz × 4
  • 15.6 GiB RAM,我确保在测试期间有大约 8 GB 可用。
  • 148.6 GB SSD 驱动器,具有充足的可用空间。
  • Ubuntu 16.04 64 位
  • MySQL Ver 14.14 Distrib 5.7.20,适用于 Linux (x86_64)

The tables:

表:

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

Then, I filled 10 million rows in each table with a PHP script whose essence is like this:

然后,我用一个 PHP 脚本填充了每个表中的 1000 万行,其本质是这样的:

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For inttables, the bit ($keys[rand(0, 9)])was replaced with just rand(0, 9), and for varchartables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string()generates a 10-character random string.

对于int表格,该位($keys[rand(0, 9)])仅替换为rand(0, 9),而对于varchar表格,我使用完整的美国州名,没有将它们剪切或扩展为 6 个字符。generate_random_string()生成一个 10 个字符的随机字符串。

Then I ran in MySQL:

然后我在 MySQL 中运行:

  • SET SESSION query_cache_type=0;
  • For jan_inttable:
    • SELECT count(*) FROM jan_int WHERE myindex = 5;
    • SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
  • For other tables, same as above, with myindex = 'califo'for chartables and myindex = 'california'for varchartables.
  • SET SESSION query_cache_type=0;
  • 对于jan_int表:
    • SELECT count(*) FROM jan_int WHERE myindex = 5;
    • SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
  • 对于其他表,同上,使用myindex = 'califo'forchar表和myindex = 'california'forvarchar表。

Times of the BENCHMARKquery on each table:

BENCHMARK每个表的查询次数:

  • jan_int: 21.30 sec
  • jan_int_index: 18.79 sec
  • jan_char: 21.70 sec
  • jan_char_index: 18.85 sec
  • jan_varchar: 21.76 sec
  • jan_varchar_index: 18.86 sec
  • jan_int:21.30 秒
  • jan_int_index:18.79 秒
  • jan_char:21.70 秒
  • jan_char_index:18.85 秒
  • jan_varchar:21.76 秒
  • jan_varchar_index:18.86 秒

Regarding table & index sizes, here's the output of show table status from janperformancetest;(w/ a few columns not shown):

关于表和索引大小,以下是show table status from janperformancetest;(未显示几列)的输出:

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

My conclusion is that there's no performance difference for this particular use case.

我的结论是这个特定用例没有性能差异。

回答by Charles Bretana

Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

取决于长度..如果 varchar 是 20 个字符,而 int 是 4,那么如果你使用 int,你的索引将在磁盘上每页索引空间的节点数是其五倍......这意味着遍历索引将需要五分之一的物理和/或逻辑读取。

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

因此,如果性能是一个问题,如果有机会,请始终为您的表以及引用这些表中行的外键使用一个完整的无意义键(称为代理)...

At the same time, to guarantee data consistency, every table where it matters should alsohave a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

同时,为了保证数据的一致性,每个重要的表应该有一个有意义的非数字备用键,(或唯一索引),以确保不能插入重复的行(根据有意义的表属性重复)。

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...

对于您所谈论的特定用途(如状态查找),这真的无关紧要,因为表的大小太小了.. 一般来说,少于几千行的表的索引对性能没有影响。 ..

回答by Timothy Khouri

Absolutely not.

绝对不。

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

我已经完成了几次……几次……INT、VARCHAR 和 CHAR 之间的性能检查。

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

无论我使用三个中的哪一个,具有 PRIMARY KEY(唯一和集群)的 1000 万条记录表具有完全相同的速度和性能(和子树成本)。

That being said... use whatever is best for your application. Don't worry about the performance.

话虽如此……使用最适合您的应用程序的任何东西。不要担心性能。

回答by Joel Coehoorn

For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

对于短代码,可能没有区别。尤其如此,因为保存这些代码的表可能非常小(最多几千行)并且不经常更改(我们最后一次添加新的美国州是什么时候)。

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.

对于键之间具有更大差异的较大表,这可能是危险的。例如,考虑使用用户表中的电子邮件地址/用户名。当您有几百万用户并且其中一些用户有很长的姓名或电子邮件地址时会发生什么。现在,每当您需要使用该键加入该表时,它都会变得更加昂贵。

回答by LeppyR64

As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

至于主键,无论物理上使行唯一的任何内容都应确定为主键。

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

对于作为外键的引用,使用自动递增整数作为代理是一个好主意,主要有两个原因。
- 首先,通常在连接中产生的开销较少。
- 其次,如果您需要更新包含唯一 varchar 的表,那么更新必须向下级联到所有子表并更新所有子表以及索引,而对于 int 代理,它只需要更新主表及其索引。

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

使用代理的缺点是您可能允许更改代理的含义:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.

这完全取决于您在结构中真正需要担心什么以及最重要的是什么。

回答by Rick James

Common cases where a surrogate AUTO_INCREMENThurts:

代理人AUTO_INCREMENT受伤的常见情况:

A common schema pattern is a many-to-many mapping:

常见的模式模式是多对多映射

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

Performance of this pattern is much better, especially when using InnoDB:

这种模式的性能要好得多,尤其是在使用 InnoDB 时:

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

Why?

为什么?

  • InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
  • The secondary index is "covering", so it does not need the extra lookup.
  • This table is smaller because of getting rid of idand one index.
  • InnoDB 辅助键需要额外查找;通过将一对移动到 PK 中,可以避免一个方向。
  • 二级索引是“覆盖”的,所以不需要额外的查找。
  • 由于去掉了id一个索引,这个表更小了。

Another case (country):

另一个案例(国家):

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

All too often the novice normalizes country_code into a 4-byte INTinstead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.

新手经常将 country_code 规范化为 4 字节,INT而不是使用“自然”的 2 字节、几乎不变的 2 字节字符串。更快,更小,更少的 JOIN,更具可读性。

回答by Volksman

The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.

问题是关于 MySQL,所以我说有很大的不同。如果是关于 Oracle(将数字存储为字符串 - 是的,我起初无法相信)那么没有太大区别。

Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.

表中的存储不是问题,但更新和引用索引才是问题。涉及根据主键查找记录的查询很频繁 - 您希望它们尽快发生,因为它们经常发生。

The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.

事情是 CPU 在硅中自然地处理 4 字节和 8 字节整数。它比较两个整数真的很快——它发生在一两个时钟周期内。

Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.

现在看看一个字符串——它由很多字符组成(现在每个字符超过一个字节)。不能在一两个周期内比较两个字符串的优先级。相反,必须迭代字符串的字符,直到找到差异为止。我确信有一些技巧可以让它在某些数据库中更快,但这在这里无关紧要,因为 int 比较是自然完成的,并且 CPU 在硅片中速度快如闪电。

My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.

我的一般规则 - 每个主键都应该是一个自动递增的 INT,尤其是在使用 ORM(Hibernate、Datanucleus 等)的 OO 应用程序中,其中对象之间有很多关系 - 它们通常总是被实现为一个简单的 FK 和快速解决这些问题的数据库对您的应用程序的响应能力很重要。

回答by Herman J. Radtke III

At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.

在 HauteLook,我们将许多表更改为使用自然键。我们确实体验到了真实世界的性能提升。正如您提到的,我们的许多查询现在使用较少的连接,这使查询的性能更高。如果有意义,我们甚至会使用复合主键。话虽如此,有些表如果有代理键就更容易使用。

Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.

此外,如果您让人们为您的数据库编写接口,代理键可能会有所帮助。第 3 方可以依赖代理键仅在极少数情况下才会更改的事实。