PostgreSQL：在大型数据库上定义主键

Question

提问by KRTac

I am planing a database to store lots of text. (blog posts, news articles, etc.) The database needs to have the title, content (50k characters max), date, link and language fields. The same content can't occur on one link. Old content (older then 30 days, for example) will be deleted.

我正在规划一个数据库来存储大量文本。（博客文章、新闻文章等）数据库需要有标题、内容（最多 50k 个字符）、日期、链接和语言字段。相同的内容不能出现在一个链接上。旧内容（例如，早于 30 天）将被删除。

Now, the problem is the primary key. I could just set a automatically incrementing (SERIAL type) field and use it as a primary key. But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key. (and the field could eventually run out, or not?) And there's always the other performance issue: the content of each new row inserted needs to be checked for duplicates. So the other solution for the primary key I've come up with would be to calculate a sha256 hash of content+link value and then put that in a new 'hash' column and use that as a primary key. Two birds with one stone. Of course, the problem with that are hash collisions. Is it a big threat?

现在，问题是主键。我可以设置一个自动递增（SERIAL 类型）字段并将其用作主键。但是，这似乎很愚蠢并且浪费了磁盘空间，因为该字段除了作为主键之外没有任何用途。（并且该字段最终可能会用完？）而且总是存在其他性能问题：需要检查插入的每个新行的内容是否重复。所以我想出的主键的另一个解决方案是计算内容+链接值的 sha256 散列，然后将其放入新的“散列”列中并将其用作主键。一石二鸟。当然，问题是哈希冲突。是不是很大的威胁？

I don't have any experience with PostgreSQL, and very little experience with DBMS's generally, so I would appreciate a second opinion before a create a database with the performance characteristics of a snail on the highway (horrible comparison).

我对 PostgreSQL 没有任何经验，对 DBMS 的经验通常也很少，所以我希望在创建具有高速公路上蜗牛性能特征的数据库之前获得第二意见（可怕的比较）。

Please help me out here if you have any experience with large databases. Is setting a 64 character string as a primary key field a good idea in my situation? (because I'm under the impression that generally this is avoided)

如果您对大型数据库有任何经验，请在这里帮助我。在我的情况下，将 64 个字符串设置为主键字段是个好主意吗？（因为我的印象是通常可以避免这种情况）

Answer 1

采纳答案by rfusca

Just did this exact test for a rather medium-large DB (200GB+), bigserial won by quite a large margin. It was faster to generate, faster to join, less code, smaller footprint. Because of the way postgres stores it, a bigint is negligible compared to a normal int. You'll run out of storage space from your content long before you ever have to worry about overflowing the bigint. Having done the computed hash vs bigint - surrogate bigint all the way.

刚刚对一个相当大的数据库（200GB+）做了这个精确的测试，bigserial 以相当大的优势获胜。生成速度更快，加入速度更快，代码更少，占用空间更小。由于 postgres 存储它的方式，与普通 int 相比，bigint 可以忽略不计。早在您不必担心 bigint 溢出之前，您的内容就会耗尽存储空间。完成计算散列与 bigint - 一直代理 bigint。

Answer 2

回答by Lars Haugseth

I would choose to use a surrogate key, ie. a key that isn't part of the business data of your application. The additional space requirements of an additional 64-bit integer when you're dealing with upto 50 kilobytes of text per record is negligible. You will actually be using lessspace as soon as you're starting using this key as a foreign key in other tables.

我会选择使用代理键，即。不属于应用程序业务数据一部分的密钥。当您处理每条记录多达 50 KB 的文本时，额外的 64 位整数的额外空间需求可以忽略不计。一旦您开始将此键用作其他表中的外键，您实际上将使用更少的空间。

Using a hash of the data stored in a record is a very bad candidate for a primary key, should the data on which the hash is based ever change. You will then have changed the primary key as well, resulting in updates all over the place if you have relations from other tables to this one.

如果散列所基于的数据发生变化，那么使用存储在记录中的数据的散列对于主键来说是非常糟糕的候选者。然后，您也将更改主键，如果您有其他表与此表的关系，则会导致整个地方的更新。

PS. A similar question has been asked and answered here before.

附注。一个类似的问题已经被问过并在这里回答过。

Here's another nice write-up about the topic: http://www.agiledata.org/essays/keys.html

这是关于该主题的另一篇不错的文章：http: //www.agiledata.org/essays/keys.html

Answer 3

回答by HLGEM

You'd have to have an awful lot of records before your primary key integer ran out.

在主键整数用完之前，您必须拥有大量记录。

The integer will be faster for joins than a 64 character string primary key would be. Also it is much easier for people writing queries to deal with.

连接的整数将比 64 个字符串的主键更快。此外，编写查询的人更容易处理。

If a collision is ever possible, you can't use the hash as your primary key. Primary keys must be guarnateed to be unique by definintion.

如果可能发生冲突，则不能使用散列作为主键。主键必须通过定义来保证是唯一的。

I've seen hundreds of production databases for different corporations and government entities and not one used a hash primary key. Think there might be a reason?

我见过数百个不同公司和政府实体的生产数据库，但没有一个使用哈希主键。想想可能有什么原因？

But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key.

但是，这似乎很愚蠢并且浪费了磁盘空间，因为该字段除了作为主键之外没有任何用途。

Since a surrogate primary key should always be meaningless except as a primary key, I'm not sure what your objection would be.

由于代理主键除了作为主键之外应该始终没有意义，我不确定您的反对意见是什么。

Answer 4

回答by bmdhacks

Some suggestions:

一些建议：

The disk storage of a 64 bit primary-key integer is negligible no matter how much content you have.
You'll never collide SHA256, and using it as a unique id isn't a bad idea.

无论您拥有多少内容，64 位主键整数的磁盘存储都可以忽略不计。
您永远不会与 SHA256 发生冲突，并且将其用作唯一 ID 并不是一个坏主意。

One nice thing about the hash method is that you don't have a single sequence source to generate new primary keys. This can be useful if your database needs to be segmented in some manner (say geographical distribution) for future scaling, as you don't have to worry about collisions, or a single-point-of-failure that generates sequences.

hash 方法的一个好处是你没有一个单一的序列源来生成新的主键。如果您的数据库需要以某种方式（例如地理分布）进行分段以进行未来扩展，这将非常有用，因为您不必担心冲突或生成序列的单点故障。

From a coding perspective, having a single primary key can be vital for joining on extra tables of data you may add in the future. I highly recommend you use one. There are benefits to either of your proposed approaches, but the hash method might be the preferred one, just because autoincrement/sequence values can cause scalability issues sometimes.

从编码的角度来看，拥有一个主键对于加入您将来可能添加的额外数据表至关重要。我强烈建议你使用一个。您提出的任何一种方法都有好处，但散列方法可能是首选方法，只是因为自动增量/序列值有时会导致可伸缩性问题。

Answer 5

回答by Jeff Ferland

Hashes are bad ideas for primary keys. They make the inserts end up in random order in the table, and that gets very costly as things have to be reallocated (though Postgres doesn't really apply that the way others do). I suggest a sequential primary key which may be a fine-grained timestamp / timestamp with sequential number following, letting you kill two birds with a stone, and a second unique index that contains your hash codes. Keep in mind you want to keep your primary key as a smaller (64 bit or less) column.

哈希对于主键来说是个坏主意。它们使插入在表中以随机顺序结束，并且由于必须重新分配事情而变得非常昂贵（尽管 Postgres 并没有像其他人那样真正应用）。我建议一个顺序主键，它可能是一个细粒度的时间戳/时间戳，后面有序列号，让你用石头杀死两只鸟，以及包含你的哈希码的第二个唯一索引。请记住，您希望将主键保留为较小（64 位或更少）的列。

See the table at http://en.wikipedia.org/wiki/Birthday_attack#The_mathematicsso you can be confident you won't have a collision.

请参阅http://en.wikipedia.org/wiki/Birthday_attack#The_mathematics 上的表格，这样您就可以确信不会发生碰撞。

Don't forget to vacuum.

不要忘记抽真空。

Answer 6

回答by josef

I would use an ordinary 32bit integer as a primary key. I don't think you will exceed that number very soon :-) The whole Wikipedia has about 3,5 millions articles... If you wrote 1000 articles per day it would take almost 6000 years to reach the maximum of the integer type.

我会使用一个普通的 32 位整数作为主键。我认为你不会很快超过这个数字 :-) 整个维基百科大约有 350 万篇文章......如果你每天写 1000 篇文章，则需要将近 6000 年才能达到整数类型的最大值。

PostgreSQL：在大型数据库上定义主键

提问by KRTac

采纳答案by rfusca

回答by Lars Haugseth

回答by HLGEM

回答by bmdhacks

回答by Jeff Ferland

回答by josef

相关推荐

最近更新

标签

PostgreSQL：在大型数据库上定义主键

提问by KRTac

采纳答案by rfusca

回答by Lars Haugseth

回答by HLGEM

回答by bmdhacks

回答by Jeff Ferland

回答by josef

相关推荐

postgresql SQL 数据库表中的多态性？

postgresql 将 postgres 服务器扩展到多个服务器

postgresql 在 order by 子句中使用布尔表达式

将访问权限转换为 PostgreSQL？

相关推荐

最近更新

标签