MySQL MySQL中的UUID性能?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2365132/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UUID performance in MySQL?
提问by Patrick Lightbody
We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.
我们正在考虑使用 UUID 值作为 MySQL 数据库的主键。插入的数据是从数十、数百甚至数千台远程计算机生成的,并且以每秒 100-40,000 次插入的速度插入,我们永远不会进行任何更新。
The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.
在我们开始剔除数据之前,数据库本身通常会达到大约 5000 万条记录,因此不是一个庞大的数据库,但也不是很小。我们还计划在 InnoDB 上运行,但如果我们正在做的事情有更好的引擎,我们愿意改变它。
We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.
我们已经准备好使用 Java 的 Type 4 UUID,但是在测试中发现了一些奇怪的行为。一方面,我们存储为 varchar(36) 并且我现在意识到我们最好使用 binary(16) - 尽管我不确定有多好。
The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?
更大的问题是:当我们有 50M 记录时,这些随机数据对索引的影响有多严重?例如,如果我们使用最左边的位带有时间戳的类型 1 UUID,我们会不会更好?或者我们应该完全放弃 UUID 并考虑 auto_increment 主键?
I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!
我正在寻找有关不同类型 UUID 在作为索引/主键存储在 MySQL 中时的性能的一般想法/提示。谢谢!
采纳答案by Dancrumb
A UUID is a Universally Unique ID. It's the universally part that you should be considering here.
UUID 是通用唯一 ID。这是您应该在这里考虑的通用部分。
Do you reallyneed the IDs to be universally unique? If so, then UUIDs may be your only choice.
您真的需要 ID 是普遍唯一的吗?如果是这样,那么 UUID 可能是您唯一的选择。
I would strongly suggest that if you douse UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).
我强烈建议,如果您确实使用 UUID,请将它们存储为数字而不是字符串。如果你有 50M+ 的记录,那么节省的存储空间会提高你的性能(虽然我不能说多少)。
If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)
如果您的 ID 不需要普遍唯一,那么我认为您不会比使用 auto_increment 做得更好,这保证 ID 在表中是唯一的(因为值每次都会增加)
回答by Kat Lim Ruiz
At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).
在我的工作中,我们使用 UUID 作为 PK。我可以从经验中告诉您的是不要将它们用作 PK(顺便说一下,SQL Server)。
It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.
这是其中之一,当您的记录少于 1000 条时,可以,但是当您有数百万条记录时,这是您能做的最糟糕的事情。为什么?因为 UUID 不是连续的,所以每次插入新记录时,MSSQL 都需要查看正确的页面来插入记录,然后再插入记录。这样做的真正丑陋后果是页面最终大小不同并且最终碎片化,所以现在我们必须定期进行碎片整理。
When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory) so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).
当您使用自动增量时,MSSQL 将始终转到最后一页,并且最终得到相同大小的页面(理论上),因此选择这些记录的性能要好得多(也是因为 INSERT 不会阻塞表/页太长)。
However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.
但是,使用 UUID 作为 PK 的一大优势是,如果我们有 DB 集群,则合并时不会发生冲突。
I would recommend the following model: 1. PK INT Identity 2. Additional column automatically generated as UUID.
我会推荐以下模型: 1. PK INT Identity 2. 附加列自动生成为 UUID。
This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).
这样,合并过程是可能的(UUID 将是您的 REAL 键,而 PK 只是临时的,可为您提供良好的性能)。
NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do. But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.
注意:最好的解决方案是使用 NEWSEQUENTIALID(就像我在评论中所说的那样),但是对于没有太多时间重构的遗留应用程序(更糟糕的是,不能控制所有插入),这是不可能的。但实际上,截至 2017 年,我认为这里最好的解决方案是 NEWSEQUENTIALID 或使用 NHibernate 执行 Guid.Comb。
Hope this helps
希望这可以帮助
回答by Kyle Rozendo
Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.
需要考虑的是,自动增量一次生成一个,无法使用并行解决方案解决。使用 UUID 的斗争最终归结为您想要实现的目标与您可能牺牲的目标。
On performance, briefly:
关于性能,简单地说:
A UUID like the one above is 36 characters long, including dashes. If you store this VARCHAR(36), you're going to decrease compare performance dramatically. This is your primary key, you don't want it to be slow.
At its bit level, a UUID is 128 bits, which means it will fit into 16 bytes, note this is not very human readable, but it will keep storage low, and is only 4 times larger than a 32-bit int, or 2 times larger than a 64-bit int. I will use a VARBINARY(16) Theoretically, this can work without a lot of overhead.
像上面那样的 UUID 有 36 个字符长,包括破折号。如果您存储此 VARCHAR(36),您将显着降低比较性能。这是您的主键,您不希望它变慢。
在位级别上,UUID 是 128 位,这意味着它可以容纳 16 个字节,请注意这不是人类可读的,但它会保持低存储量,并且仅比 32 位 int 大 4 倍,或 2比 64 位 int 大几倍。我将使用 VARBINARY(16) 理论上,这可以在没有大量开销的情况下工作。
I recommend reading the following two posts:
我建议阅读以下两篇文章:
I reckon between the two, they answer your question.
我认为两者之间,他们回答你的问题。
回答by Glenn J. Schworak
I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.
我倾向于避免使用 UUID,因为它存储起来很麻烦,用作主键也很麻烦,但它有优势。主要的一个是它们是独一无二的。
I usually solve the problem and avoid UUID by using dual key fields.
我通常通过使用双键字段来解决问题并避免使用 UUID。
COLLECTOR = UNIQUE ASSIGNED TO A MACHINE
收集器 = 唯一分配给机器
ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)
ID = 收集者收集的记录(auto_inc 字段)
This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.
这给了我两件事。auto-inc 字段的速度和数据在收集和分组后存储在中央位置的唯一性。我也知道在浏览数据收集的地方时,这对于我的需求来说通常非常重要。
I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.
在为客户处理其他数据集时,我见过很多案例,他们决定使用 UUID,但仍然有一个用于收集数据的字段,这确实是在浪费精力。简单地使用两个(或更多,如果需要)字段作为您的密钥确实有帮助。
I have just seen too many performance hits using UUID. They feel like a cheat...
我刚刚看到使用 UUID 的性能下降太多了。他们感觉像个骗子……
回答by Bouke Versteegh
Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.
不是为每次插入集中生成唯一的密钥,而是将密钥块分配给各个服务器怎么样?当他们用完密钥时,他们可以请求一个新的区块。然后您通过为每个插入进行连接来解决开销问题。
Keyserver maintains next available id
Keyserver 维护下一个可用的 id
- Server 1 requests id block.
- Keyserver returns (1,1000)
Server 1 can insert a 1000 records until it needs to request a new block - Server 2 requests index block.
- Keyserver returns (1001,2000)
- etc...
- 服务器 1 请求 id 块。
- Keyserver 返回 (1,1000)
Server 1 可以插入 1000 条记录,直到它需要请求一个新块 - 服务器 2 请求索引块。
- 密钥服务器返回 (1001,2000)
- 等等...
You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.
您可以提出一个更复杂的版本,其中服务器可以请求所需密钥的数量,或者将未使用的块返回给密钥服务器,然后当然需要维护已使用/未使用块的映射。
回答by StephenS
The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:
简短的回答是,由于它们的索引方法和 UUID 在高位的故意熵之间的冲突,许多数据库存在性能问题(尤其是高 INSERT 卷)。有几种常见的黑客:
- choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
- munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
- make the UUID a secondary key with an auto-increment int primary key
- 选择不介意的不同索引类型(例如 MSSQL 上的非聚集索引)
- 处理数据以将熵移动到低位(例如在 MySQL 上重新排序 V1 UUID 的字节)
- 使 UUID 成为具有自动递增 int 主键的辅助键
... but these are all hacks--and probably fragile ones at that.
......但这些都是黑客 - 并且可能是脆弱的。
The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.
最好的答案,但不幸的是,最慢的答案是要求您的供应商改进他们的产品,以便它可以像处理任何其他类型一样将 UUID 作为主键处理。他们不应该强迫您推出自己的半生不熟的 hack 来弥补他们未能解决已成为常见用例并且只会继续增长的问题。
回答by Nikolai
I would assign each server a numeric ID in a transactional manner. Then, each record inserted will just autoincrement its own counter. Combination of ServerID and RecordID will be unique. ServerID field can be indexed and future select performance based on ServerID (if needed) may be much better.
我会以交易方式为每个服务器分配一个数字 ID。然后,插入的每条记录都会自动增加自己的计数器。ServerID 和 RecordID 的组合将是唯一的。ServerID 字段可以被索引,未来基于 ServerID 的选择性能(如果需要)可能会好得多。
回答by MindStalker
What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???
一些手工制作的 UID 怎么样?给数千台服务器中的每台一个 ID,并使主键成为自增的组合键,MachineID ???
回答by MindStalker
Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.
由于主键是分散生成的,因此无论如何您都无法选择使用 auto_increment。
If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.
如果您不必隐藏远程机器的身份,请使用类型 1 UUID 而不是 UUID。它们更容易生成并且至少不会损害数据库的性能。
The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?
varchar (char,真的) vs. binary 也是如此:它只能帮助问题。真的重要吗,性能提升了多少?