SQL 数据库索引如何工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:03:35  来源:igfitidea点击:

How does database indexing work?

sqldatabaseperformanceindexingdatabase-indexes

提问by Xenph Yan

Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?

鉴于随着数据集大小的增加,索引非常重要,有人可以解释索引如何在与数据库无关的级别工作吗?

For information on queries to index a field, check out How do I index a database column.

有关索引字段的查询的信息,请查看如何索引数据库列

回答by Xenph Yan

Why is it needed?

为什么需要它?

When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

当数据存储在基于磁盘的存储设备上时,它被存储为数据块。这些块被整体访问,使它们成为原子磁盘访问操作。磁盘块的结构与链表非常相似;两者都包含一个数据部分,一个指向下一个节点(或块)位置的指针,并且两者都不需要连续存储。

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn't sorted requires a Linear Search which requires N/2block accesses (on average), where Nis the number of blocks that the table spans. If that field is a non-key field (i.e. doesn't contain unique entries) then the entire tablespace must be searched at Nblock accesses.

由于许多记录只能在一个字段上排序,我们可以声明在未排序的字段上搜索需要线性搜索,这需要N/2块访问(平均),其中N的块数是桌子跨越。如果该字段是非关键字段(即不包含唯一条目),则必须在N块访问时搜索整个表空间。

Whereas with a sorted field, a Binary Search may be used, which has log2 Nblock accesses. Also since the data is sorted given a non-key field, the rest of the table doesn't need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

而对于排序字段,可以使用二分搜索,它具有log2 N块访问。此外,由于数据是在给定非键字段的情况下排序的,因此一旦找到更高的值,就不需要搜索表的其余部分以查找重复值。因此,性能提升是显着的。

What is indexing?

什么是索引?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

索引是一种对多个字段上的大量记录进行排序的方法。在表中的字段上创建索引会创建另一个数据结构,该结构保存字段值和指向它相关记录的指针。然后对该索引结构进行排序,允许对其执行二分搜索。

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

索引的缺点是这些索引需要额外的磁盘空间,因为这些索引使用 MyISAM 引擎一起存储在一个表中,如果对同一表中的许多字段进行索引,该文件可以快速达到底层文件系统的大小限制.

How does it work?

它是如何工作的?

Firstly, let's outline a sample database table schema;

首先,让我们概述一个示例数据库表架构;

Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id(a sorted key field) and one using the firstName(a non-key unsorted field).

注意:使用 char 代替 varchar 以允许精确的磁盘值大小。此示例数据库包含 500 万行并且未编入索引。现在将分析几个查询的性能。这些是使用id(排序的键字段)的查询和使用firstName(非键的未排序字段)的查询。

Example 1- sorted vs unsorted fields

示例 1-排序与未排序的字段

Given our sample database of r = 5,000,000records of a fixed size giving a record length of R = 204bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000blocks.

给定我们r = 5,000,000的固定大小记录示例数据库,给出R = 204字节记录长度,它们使用 MyISAM 引擎存储在表中,该引擎使用默认块大小B = 1,024字节。表的bfr = (B/R) = 1024/204 = 5块因子将是每个磁盘块的记录。持有该表所需的块总数为N = (r/bfr) = 5000000/5 = 1,000,000块。

A linear search on the id field would require an average of N/2 = 500,000block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20block accesses. Instantly we can see this is a drastic improvement.

N/2 = 500,000鉴于 id 字段是一个关键字段,对 id 字段的线性搜索需要平均块访问才能找到一个值。但由于 id 字段也已排序,因此可以执行需要平均log2 1000000 = 19.93 = 20块访问的二分查找。我们立即可以看到这是一个巨大的改进。

Now the firstNamefield is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000block accesses. It is this situation that indexing aims to correct.

现在firstName字段既不是排序的也不是键字段,所以二进制搜索是不可能的,值也不是唯一的,因此该表将需要搜索到最后以获取精确的N = 1,000,000块访问。索引旨在纠正这种情况。

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstNamefield is outlined below;

鉴于一个索引记录只包含索引字段和一个指向原始记录的指针,它比它指向的多字段记录小是有道理的。因此索引本身比原始表需要更少的磁盘块,因此需要更少的块访问来迭代。下面概述了firstName字段上索引的架构;

Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

注意:MySQL 中的指针长度为 2、3、4 或 5 个字节,具体取决于表的大小。

Example 2- indexing

示例 2-索引

Given our sample database of r = 5,000,000records with an index record length of R = 54bytes and using the default block size B = 1,024bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778blocks.

给定我们的示例r = 5,000,000记录数据库,其索引记录长度为R = 54字节并使用默认块大小B = 1,024字节。索引的阻塞因子是bfr = (B/R) = 1024/54 = 18每个磁盘块的记录。保存索引所需的块总数为N = (r/bfr) = 5000000/18 = 277,778块。

Now a search using the firstNamefield can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20block accesses, a far cry from the 1,000,000 block accesses required to find a firstNamematch in the non-indexed table.

现在使用firstName字段的搜索可以利用索引来提高性能。这允许使用平均log2 277778 = 18.08 = 19块访问对索引进行二分搜索。查找实际记录的地址,这需要进一步的块访问才能读取,使总19 + 1 = 20访问量达到块访问,这与在非索引表中找到firstName匹配所需的 1,000,000 次块访问相去甚远。

When should it be used?

应该什么时候使用?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

鉴于创建索引需要额外的磁盘空间(从上面的示例中额外增加了 277,778 个块,增加了约 28%),并且索引过多会导致文件系统大小限制引起的问题,因此必须仔细考虑以选择正确的要索引的字段。

Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

由于索引仅用于加快在记录中搜索匹配字段的速度,因此在执行插入或删除操作时,仅用于输出的索引字段只会浪费磁盘空间和处理时间,因此按理说应该避免。考虑到二进制搜索的性质,数据的基数或唯一性也很重要。对基数为 2 的字段进行索引会将数据分成两半,而基数为 1,000 将返回大约 1,000 条记录。如此低的基数,效率降低为线性排序,如果基数小于记录数的 30%,查询优化器将避免使用索引,有效地使索引浪费空间。

回答by 147.3k

Classic example "Index in Books"

经典示例“书籍索引”

Consider a "Book" of 1000 pages, divided by 10 Chapters, each section with 100 pages.

考虑一本 1000 页的“书”,分为 10 个章节,每个部分有 100 页。

Simple, huh?

很简单吧?

Now, imagine you want to find a particular Chapter that contains a word "Alchemist". Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.

现在,假设您要查找包含“炼金术士”一词的特定章节。如果没有索引页,除了扫描整本书/章节之外,您别无选择。即:1000 页。

This analogy is known as "Full Table Scan"in database world.

这个类比在数据库世界中被称为“全表扫描”

enter image description here

在此处输入图片说明

But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.

但是有了索引页,您就知道该去哪里了!而且,要查找任何重要的特定章节,您只需要一次又一次地查看索引页面。找到匹配的索引后,您可以通过跳过其余部分有效地跳到该章节。

But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.

但是,除了实际的 1000 页之外,您还需要大约 10 页来显示索引,因此总共 1010 页。

Thus, the index is a separate section that stores values of indexed column + pointer to the indexed row in a sorted order for efficient look-ups.

因此,索引是一个单独的部分,它按排序顺序存储索引列的值 + 指向索引行的指针,以便高效查找。

Things are simple in schools, isn't it? :P

学校里的事情很简单,不是吗?:P

回答by Der U

The first time I read this it was very helpful to me. Thank you.

第一次读到这篇文章对我很有帮助。谢谢你。

Since then I gained some insight about the downside of creating indexes: if you write into a table (UPDATEor INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)

从那时起,我对创建索引的缺点有了一些了解:如果您使用一个索引写入表(UPDATEINSERT),则实际上在文件系统中有两个写入操作。一个用于表数据,另一个用于索引数据(以及对它的重新排序(以及 - 如果集群 - 对表数据的重新排序))。如果表和索引位于同一个硬盘上,这会花费更多时间。因此,没有索引(堆)的表将允许更快的写入操作。(如果你有两个索引,你最终会得到三个写操作,依此类推)

However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.

但是,在两个不同的硬盘上为索引数据和表数据定义两个不同的位置可以减少/消除时间成本增加的问题。这需要使用所需硬盘上的相应文件定义附加文件组,并根据需要定义表/索引位置。

Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZEhelps, you must write routines to have it done.

索引的另一个问题是随着数据的插入,它们会随着时间的推移而碎片化。REORGANIZE有帮助,您必须编写例程才能完成。

In certain scenarios a heap is more helpful than a table with indexes,

在某些情况下,堆比带索引的表更有帮助,

e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.

例如:- 如果你有很多竞争性的写作,但只有在工作时间以外每晚阅读一次以进行报告。

Also, a differentiation between clustered and non-clustered indexes is rather important.

此外,聚集索引和非聚集索引之间的区别相当重要。

Helped me:- What do Clustered and Non clustered index actually mean?

帮助了我:-聚集索引和非聚集索引实际上是什么意思?

回答by hcarreras

An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.

索引只是一种数据结构,它可以更快地搜索数据库中的特定列。这种结构通常是一个 b 树或一个哈希表,但它可以是任何其他逻辑结构。

回答by Somnath Muluk

Now, let's say that we want to run a query to find all the details of any employees who are named ‘Abc'?

现在,假设我们要运行查询以查找名为“Abc”的任何员工的所有详细信息?

SELECT * FROM Employee 
WHERE Employee_Name = 'Abc'

What would happen without an index?

没有索引会发生什么?

Database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Abc'. And, because we want every row with the name ‘Abc' inside it, we can not just stop looking once we find just one row with the name ‘Abc', because there could be other rows with the name Abc. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Abc'. This is what is called a full table scan

数据库软件实际上必须查看 Employee 表中的每一行,以查看该行的 Employee_Name 是否为“Abc”。而且,因为我们希望在其中包含名称为 'Abc' 的每一行,一旦我们找到名称为 'Abc' 的一行,我们不能就停止查找,因为可能还有其他名称为Abc 的行。因此,必须搜索直到最后一行的每一行——这意味着在这种情况下,数据库必须检查数千行才能找到名为“Abc”的行。这就是所谓的全表扫描

How a database index can help performance

数据库索引如何帮助提高性能

The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined. An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table.

拥有索引的全部意义在于通过从本质上减少表中需要检查的记录/行数来加速搜索查询。索引是一种数据结构(最常见的是 B 树),用于存储表中特定列的值。

How does B-trees index work?

B 树索引是如何工作的?

The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS's, you can actually specify which data structure you want your database to use when you create the index itself.

B-tree 之所以成为最流行的索引数据结构,是因为它们具有时间效率——因为查找、删除和插入都可以在对数时间内完成。而且,B-tree 更常用的另一个主要原因是因为可以对存储在 B-tree 中的数据进行排序。RDBMS 通常确定哪个数据结构实际用于索引。但是,在某些使用某些 RDBMS 的场景中,您实际上可以在创建索引时指定希望数据库使用的数据结构。

How does a hash table index work?

哈希表索引如何工作?

The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index.

使用散列索引的原因是因为散列表在查找值时非常有效。因此,如果查询与字符串的相等性比较,则它们可以非常快速地检索值,如果它们使用哈希索引。

For instance, the query we discussed earlier could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Abc => 0x28939″, where 0x28939 is a reference to the table row where Abc is stored in memory. Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Abc” in the Employee_Name column.

例如,我们之前讨论的查询可以受益于在 Employee_Name 列上创建的哈希索引。哈希索引的工作方式是,列值将成为哈希表中的键,映射到该键的实际值只是指向表中行数据的指针。由于哈希表基本上是一个关联数组,因此典型的条目看起来像“Abc => 0x28939”,其中 0x28939 是对表行的引用,其中 Abc 存储在内存中。在哈希表索引中查找像“Abc”这样的值并返回对内存中行的引用显然比扫描表以查找 Employee_Name 列中具有“Abc”值的所有行快得多。

The disadvantages of a hash index

哈希索引的缺点

Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it's not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality

哈希表不是排序的数据结构,并且有许多类型的查询,哈希索引甚至无法提供帮助。例如,假设您想找出所有 40 岁以下的员工。你怎么能用哈希表索引做到这一点?嗯,这是不可能的,因为哈希表仅适用于查找键值对——这意味着检查相等性的查询

What exactly is inside a database index?So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.

数据库索引里面到底是什么?所以,现在您知道数据库索引是在表中的列上创建的,并且该索引将值存储在该特定列中。但是,重要的是要了解数据库索引不会将值存储在同一表的其他列中。例如,如果我们在 Employee_Name 列上创建索引,这意味着 Employee_Age 和 Employee_Address 列值也不会存储在索引中。如果我们只将所有其他列存储在索引中,那么就像创建整个表的另一个副本一样——这会占用太多空间并且效率很低。

How does a database know when to use an index?When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Abc' ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.

数据库如何知道何时使用索引?当像“SELECT * FROM Employee WHERE Employee_Name = 'Abc'”这样的查询运行时,数据库将检查被查询的列是否有索引。假设 Employee_Name 列上确实创建了一个索引,数据库将不得不决定使用索引来查找正在搜索的值是否真的有意义——因为在某些情况下,使用数据库索引实际上效率较低,并且更有效地扫描整个表。

What is the cost of having a database index?

拥有数据库索引的成本是多少?

It takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.

它占用空间——你的表越大,你的索引就越大。索引的另一个性能问题是,无论何时添加、删除或更新相应表中的行,都必须对索引执行相同的操作。请记住,索引需要包含与索引涵盖的表列中的任何内容相同的分钟数据。

As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.

作为一般规则,只有当索引列中的数据会被频繁查询时,才应在表上创建索引。

See also

也可以看看

  1. What columns generally make good indexes?
  2. How do database indexes work
  1. 哪些列通常是好的索引?
  2. 数据库索引是如何工作的

回答by ProgrammerPanda

Simple Description!

简单说明!

The index is nothing but a data structure that stores the values for a specific columnin a table. An index is created on a column of a table.

索引只不过是一种数据结构,用于存储表中特定列的值。索引是在表的列上创建的。

Example: We have a database table called Userwith three columns – Name, Ageand Address. Assume that the Usertable has thousands of rows.

例如:我们有一个叫做数据库表User有三列- NameAgeAddress。假设该User表有数千行。

Now, let's say that we want to run a query to find all the details of any users who are named 'John'. If we run the following query:

现在,假设我们要运行查询以查找名为“John”的任何用户的所有详细信息。如果我们运行以下查询:

SELECT * FROM User 
WHERE Name = 'John'

The database software would literally have to look at every single row in the Usertable to see if the Namefor that row is ‘John'. This will take a long time.

数据库软件实际上必须查看表中的每一行,User以查看Name该行的名称是否为“John”。这将需要很长时间。

This is where indexhelps us: index is used to speed up search queries by essentially cutting down the number of records/rows in a table that needs to be examined.

index对我们有帮助:索引用于通过基本上减少表中需要检查的记录/行数来加速搜索查询

How to create an index:

创建索引的方法:

CREATE INDEX name_index
ON User (Name)

An indexconsists of column values(Eg: John) from one table, and those values are stored in a data structure.

Anindex一个表中列值(例如:John)组成,这些值存储在一个数据结构中

So now the database will use the index to find employees named John because the index will presumably be sorted alphabetically by the Users name. And, because it is sorted, it means searching for a name is a lot faster because all names starting with a “J” will be right next to each other in the index!

所以现在数据库将使用索引来查找名为 John 的员工,因为该索引可能会按用户姓名的字母顺序排序。而且,因为它是排序的,这意味着搜索名称要快得多,因为所有以“J”开头的名称将在索引中彼此相邻!

回答by Raza

Just a quick suggestion.. As indexing costs you additional writes and storage space, so if your application requires more insert/update operation, you might want to use tables without indexes, but if it requires more data retrieval operations, you should go for indexed table.

只是一个简单的建议.. 因为索引需要额外的写入和存储空间,所以如果你的应用程序需要更多的插入/更新操作,你可能想要使用没有索引的表,但如果它需要更多的数据检索操作,你应该去索引桌子。

回答by Alf Moh

Just think of Database Index as Index of a book.

只需将数据库索引视为一本书的索引。

If you have a book about dogs and you want to find an information about let's say, German Shepherds, you could of course flip through all the pages of the book and find what you are looking for - but this of course is time consuming and not very fast.

如果你有一本关于狗的书,并且你想找到关于德国牧羊犬的信息,你当然可以翻阅这本书的所有页面并找到你要找的东西——但这当然很耗时,而不是非常快。

Another option is that, you could just go to the Index section of the book and then find what you are looking for by using the Name of the entity you are looking ( in this instance, German Shepherds) and also looking at the page number to quickly find what you are looking for.

另一种选择是,您可以直接转到本书的“索引”部分,然后通过使用您正在查找的实体的名称(在本例中为德国牧羊犬)并查看页码来查找您要查找的内容快速找到您要找的东西。

In Database, the page number is referred to as a pointer which directs the database to the address on the disk where entity is located. Using the same German Shepherd analogy, we could have something like this (“German Shepherd”, 0x77129) where 0x77129is the address on the disk where the row data for German Shepherd is stored.

在数据库中,页码被称为指针,它将数据库指向实体所在磁盘上的地址。使用相同的德国牧羊犬类比,我们可以有这样的东西(“德国牧羊犬”,0x77129) 其中0x77129是磁盘上存储德国牧羊犬行数据的地址。

In short, an index is a data structure that stores the values for a specific column in a table so as to speed up query search.

简而言之,索引是一种数据结构,用于存储表中特定列的值,以加快查询搜索速度。