postgresql 在 Postgres 中创建索引的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18580448/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 00:23:02  来源:igfitidea点击:

Most Efficient Way to Create an Index in Postgres

performancepostgresqlindexing

提问by mvp

Is it more efficient to create an index after loading data is complete or before, or does it not matter?

在加载数据完成之后还是之前创建索引更有效,还是无关紧要?

For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:

例如,假设我有 500 个文件要加载到 Postgres 8.4 DB 中。以下是我可以使用的两个索引创建场景:

  1. Create index when table is created, then load each file into table; or
  2. Create index after all files have been loaded into the table.
  1. 创建表时创建索引,然后将每个文件加载到表中;或者
  2. 在所有文件加载到表中后创建索引。

The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:

表数据本身大约为 45 GB。该索引约为 12 GB。我正在使用标准索引。它是这样创建的:

CREATE INDEX idx_name ON table_name (column_name);

My data loading uses COPY FROM.

我的数据加载使用COPY FROM

Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.

一旦加载了所有文件,表上就不会发生更新、删除或额外加载(这是一天的数据不会改变)。所以我想问一下哪种情况最有效?初步测试似乎表明加载所有文件然后创建索引(场景 2)更快,但我没有对这两种方法进行科学比较。

回答by mvp

Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.

您的观察是正确的 - 首先加载数据然后再创建索引效率更高。这样做的原因是插入期间的索引更新很昂贵。如果在所有数据都存在后创建索引,速度会快得多。

It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.

更进一步 - 如果您需要将大量数据导入现有索引表,通常先删除现有索引,导入数据,然后再次重新创建索引,效率更高。

One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

导入后创建索引的一个缺点是表必须被锁定,这可能需要很长时间(在相反的情况下不会被锁定)。但是,在 PostgreSQL 8.2 及更高版本中,您可以使用CREATE INDEX CONCURRENTLY,它不会在索引期间锁定表(有一些警告)。