database DBMS 中的阻塞因素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15859070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 08:55:51  来源:igfitidea点击:

Blocking factor in a DBMS

databaseindexing

提问by Sim

What is the blocking factor in a DBMS,

什么是 DBMS 中的阻塞因子,

The bit I looked at said it was the floored value of blocks per record (so B/R floor), where B is block size and R is records. I was just wondering, can someone tell me the main reason its used, and also whether it is actually FLOORED?

我看到的那一点说它是每条记录的块的下限值(所以 B/R 下限),其中 B 是块大小,R 是记录。我只是想知道,有人能告诉我使用它的主要原因,以及它是否真的是 FLOORED?

My understanding of FLOORED is 1.5 gets floored to 1.0, for anyone that is wondering.

对于任何想知道的人来说,我对 FLOORED 的理解是 1.5 到 1.0。

回答by Tarnay Kálmán

Yes, it means how many whole records fit into a block.

是的,这意味着有多少完整的记录适合一个块。

(A block is the smallest unit of data that the underlying storage system (hdd, san fs, etc) is willing to deal in. It's size is traditionally 512 bytes for hard drives.)

(块是底层存储系统(hdd、san fs 等)愿意处理的最小数据单位。它的大小传统上是 512 字节的硬盘驱动器。)

It is floored because if 100 and a half record would fit, one only stores 100 records per block.

它是地板,因为如果 100 条半记录适合,每个块只能存储 100 条记录。

Blocking factor is pretty heavily used in many dbms related calculations.

阻塞因子在许多与 dbms 相关的计算中被大量使用。

For example:

例如:

The problem

问题

We have 10 000 000 records. Each record is 80 bytes long. Each record contains an unique key (Lets say social security numbers). We want looking up someone by their social security number to be fast.

我们有 10 000 000 条记录。每条记录的长度为 80 字节。每条记录都包含一个唯一的密钥(比方说社会安全号码)。我们希望通过社会安全号码快速查找某人。

But what is fast?

但什么是快?

We need something to measure performance by. The thing that takes the most time is asking a block from the harddisk. You know, it is a mechanical device. It has to reposition its head, and blabla, so it really a slow operation when compared to how fast the CPU is, or compared to how fast operative memory(RAM) access is. Okay, lets say that we measure the performance of an operation by how many disk accesses it takes. We want to minimize the number of disk accesses. Okay, now we know how to tell if something is slow or fast.

我们需要一些东西来衡量性能。花费最多时间的事情是从硬盘中请求一个块。你知道,它是一个机械装置。它必须重新定位它的头部和 blabla,因此与 CPU 的速度或操作内存 (RAM) 访问的速度相比,它的操作确实很慢。好的,假设我们通过访问磁盘的次数来衡量操作的性能。我们希望最小化磁盘访问次数。好的,现在我们知道如何判断某个东西是慢的还是快的。

Many disk accesses -> bad

许多磁盘访问 -> 坏

Very few disk accesses -> good

很少访问磁盘 -> 好

Calculating how many blocks our data needs

计算我们的数据需要多少块

Lets say that on our imaginary hw, each block is 5000 byte. We want calculate how many blocks we need. First, we need to know how many records fit into a single block:

假设在我们想象的硬件上,每个块是 5000 字节。我们想要计算我们需要多少块。首先,我们需要知道一个块中有多少条记录:

Blocking factor= floored((Block size)/(Record size))= floored(5000/80)= floored(62.5)= 62 record/block

Blocking factor= floored((Block size)/(Record size))= floored(5000/80)= floored(62.5)=62 record/block

And we have 10000000 records, so we need ceiled(10000000/62)=ceiled(161290.32)=161291blocks to store all that data.

我们有 10000000 条记录,所以我们需要ceiled(10000000/62)=ceiled(161290.32)=161291块来存储所有这些数据。

Whoa, that's a lot of data. How do I look up someone fast?

哇,这是很多数据。怎么快速查到一个人?

If one were to read all the blocks to find a single record by the key (social security number), then that would take 161291 disk accesses. Not good.

如果要读取所有块以通过密钥(社会保险号)找到单个记录,则需要 161291 次磁盘访问。不好。

We can do better. Lets build an index file. We will build a sparse index.

我们可以做得更好。让我们建立一个索引文件。我们将建立一个稀疏索引

A sparse index in databases is a file with pairs of keys and pointers for every block in the data file. Every key in this file is associated with a particular pointer to the block in the sorted data file. In clustered indices with duplicate keys, the sparse index points to the lowest search key in each block.

数据库中的稀疏索引是一个文件,其中包含数据文件中每个块的键和指针对。此文件中的每个键都与指向已排序数据文件中的块的特定指针相关联。在具有重复键的聚集索引中,稀疏索引指向每个块中最低的搜索键。

Okay, so we will have a pointer and a key in our index file for each block. Lets say that on our imaginary hw, a pointer is 4 bytes long, and in our imaginary world a social security number (key) takes up 6 bytes.

好的,所以我们将在索引文件中为每个块提供一个指针和一个键。假设在我们想象的硬件上,一个指针有 4 个字节长,而在我们想象的世界中,一个社会保险号(密钥)占用了 6 个字节。

So we are going to store one 10 byte long key-pointer pair for each block in our index. How many of these pairs fit into a single block?

因此,我们将为索引中的每个块存储一个 10 字节长的键指针对。这些对中有多少可以放入一个块中?

Blocking factor of the index file = floored(5000/10) = 500

... so this means that 500 key-pointer pairs fit into a single block. And we need to store 161291 of these, so the index file will take up ceiled(161291/500)=323blocks

...所以这意味着 500 个关键指针对适合单个块。我们需要存储其中的 161291 个,因此索引文件将占用ceiled(161291/500)=323

The index file is ordered by key, so we can do binary search in it to find the pointer to the block which contains the record. Doing binary search in the index file costs at most ceiled(log2(323))=9disk acceses. We also need +1disk access to actually read the data block which the index record points to.

索引文件是按关键字排序的,因此我们可以在其中进行二分查找以找到指向包含该记录的块的指针。在索引文件中进行二分查找的代价是最多的ceiled(log2(323))=9磁盘访问。我们还需要+1磁盘访问来实际读取索引记录指向的数据块。

Wow, we got our lookup to work in 10 disk accesses. That's pretty awesome. We could even do better. :)

哇,我们的查找可以在 10 次磁盘访问中工作。这真是太棒了。我们甚至可以做得更好。:)

Okay, so you can see that blocking factor is heavy used for example in this calculation.

好的,因此您可以看到,例如在此计算中大量使用了阻塞因子。