PostgreSQL 使用 UUID 与文本作为主键

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33836749/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 02:05:22  来源:igfitidea点击:

PostgreSQL using UUID vs Text as primary key

postgresqlprimary-keyuuid

提问by Scottie

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.

我们当前的 PostgreSQL 数据库使用 GUID 作为主键并将它们存储为文本字段。

My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.

我对此的最初反应是,尝试执行任何类型的最小笛卡尔连接对于试图找到所有匹配记录的索引来说都是一场噩梦。但是,也许我对数据库索引的有限理解在这里是错误的。

I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.

我认为我们应该使用 UUID,因为它们存储为 GUID 的二进制表示,其中 Text 不是,并且您在 Text 列上获得的索引量最小。

It would be a significant project to change these, and I'm wondering if it would be worth it?

改变这些将是一个重要的项目,我想知道这是否值得?

采纳答案by khampson

As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary keyin a table or part of a unique index.

正如@Kevin 提到的,确定确切数据的唯一方法是比较和对比这两种方法,但从您所描述的内容来看,我不明白为什么这与字符串的任何其他情况不同要么是表中的主键,要么是唯一索引的一部分。

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

可以预先说的是,您的索引可能会更大,因为它们必须存储更大的字符串值,理论上索引的比较将花费更长的时间,但我不提倡过早优化,如果这样做会痛苦。

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

根据我的经验,我在具有数十亿行的表上使用 md5sums 的唯一索引上看到了非常好的性能。我发现它往往是有关查询的其他因素,这些因素往往会导致性能问题。例如,当你最终需要查询一个非常大的表,比如数十万行时,顺序扫描最终是更好的选择,这就是查询规划器的选择,它可能需要更长的时间.

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hiveor Impalain the Hadoopsphere).

对于这种情况,还有其他缓解策略,例如对查询进行分块,然后UNION对结果进行处理(例如,手动模拟将在Hadoop领域的HiveImpala中完成的那种事情)。

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

回复:您对文本索引的担忧,虽然我确信在某些情况下,数据集会产生一个密钥分布,以至于它的表现非常糟糕,但 GUID,很像 md5sums、sha1 等。通常应该索引得很好,不需要顺序扫描(除非,正如我上面提到的,您查询了大量的表)。

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

关于索引如何执行的重要因素之一是有多少唯一值。出于这个原因,具有大量行的表上的布尔索引不太可能有帮助,因为它基本上最终会导致任何值(真、假和潜在的)的大量行冲突NULL) 在索引中。另一方面,GUID 索引可能有大量没有冲突的值(理论上定义,因为它们是 GUID)。

Edit in response to comment from OP:

编辑回应来自 OP 的评论:

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

那么您是说就索引而言,UUID guid 与 Text guid 是一回事吗?我们的整个表结构使用带有类似 guid 字符串的文本字段,但我不确定 Postgre 是否将其识别为 Guid。只是一个碰巧是唯一的字符串。

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

字面上不一样,不。但是,我是说对于这种特殊情况,它们应该具有非常相似的性能,而且我不明白为什么预先优化值得这样做,尤其是考虑到您说这样做将是一项非常复杂的任务。

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

如果在您的特定环境中遇到性能问题,您可以随时更改内容。但是,正如我之前提到的,我认为如果您遇到这种情况,还有其他事情可能会比更改 PK 数据类型产生更好的性能。

A UUIDis a 128-bit data type (so, 16 bytes), whereas texthas 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimumof 33 bytes, but could vary significantly depending on the encoding used.

UUID是一个128位的数据类型(因此,16个字节),而文本具有1或4个字节的开销加上字符串的实际长度。对于 GUID,这意味着最少33 个字节,但根据所使用的编码可能会有很大差异。

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

因此,考虑到这一点,基于文本的 UUID 的索引肯定会更大,因为值更大,并且将两个字符串与两个数值进行比较在理论上效率较低,但不会在这方面产生巨大差异情况,至少不是通常的情况。

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

我不会预先优化何时这样做会产生巨大的成本并且可能永远不需要。如果那个时候真的来了,那座桥可以跨越(尽管我会首先进行其他查询优化,正如我上面提到的)。

Regarding whether Postgresknows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgreswill treat the value as such during that query.

关于Postgres是否知道字符串是 GUID,默认情况下它绝对不知道。就它而言,它只是一个唯一的字符串。但这对于大多数情况应该没问题,例如匹配行等。如果您发现自己需要一些特别需要 GUID 的行为(例如,一些基于非等式的比较,其中 GUID 比较可能与纯词法的比较不同),那么您总是可以将字符串转换为 UUID,并且Postgres将处理该查询期间的值。

e.g. for a text column foo, you can do foo::uuidto cast it to a uuid.

例如,对于文本列foo,您可以将foo::uuid其强制转换为uuid.

There's also a module available for generating uuids, uuid-ossp.

还有一个模块可用于生成uuids,uuid-ossp

回答by Erwin Brandstetter

When dealing with UUID numbers store them as data type uuid. Always.There is simply no good reason to even consider textas alternative. Input and output is done via text representation by default anyway. The cast is very cheap.

在处理 UUID 编号时,将它们存储为数据类型uuid总是。甚至没有充分的理由考虑text作为替代方案。无论如何,默认情况下输入和输出都是通过文本表示完成的。演员阵容很便宜。

The data type textrequires more space in RAM and on disk, is slower to process and more error prone. @khampson's answerprovides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.

该数据类型text需要更多的 RAM 和磁盘空间,处理速度较慢且更容易出错。@khampson 的回答提供了大部分理由。奇怪的是,他似乎并没有得出相同的结论。

This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:

这都是以前被问过、回答过和讨论过的。dba.SE 上的相关问题,详细解释:

bigint?

bigint?

Maybe you don't need UUIDs (GUIDs) at all. Consider bigintinstead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:

也许您根本不需要 UUID (GUID)。考虑一下bigint。它只占用 8 个字节,并且在各方面都更快。它的范围经常被低估:

-9223372036854775808 to +9223372036854775807

That's 9.2 millions of millions of millionspositive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.

那是920 万个数百万个正数。IOW,九十亿二百二十三千万三百七十二亿三十六亿

If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.

如果您每秒烧掉 100 万个 ID(这是一个非常高的数字),您可以继续这样做 292471。然后是另一个 292471 年的负数。“几千万或几亿”甚至还差得远

UUID is really just for distributed systems and other special cases.

UUID 实际上仅适用于分布式系统和其他特殊情况。