database 数据库如何在内部工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/172925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:00:00  来源:igfitidea点击:

How do databases work internally?

databasereferenceinternals

提问by Bonnici

I've been working with databases for the last few years and I'd like to think that I've gotten fairly competent with using them. However I was reading recently about Joel's Law of Leaky Abstractionsand I realised that even though I can write a query to get pretty much anything I want out of a database, I have no idea how the database actually interprets the query. Does anyone know of any good articles or books that explain how databases work internally?

过去几年我一直在使用数据库,我想我已经相当有能力使用它们了。然而,我最近正在阅读有关乔尔抽象抽象定律的内容,我意识到即使我可以编写查询来从数据库中获取几乎任何我想要的东西,但我不知道数据库实际上是如何解释查询的。有没有人知道任何解释数据库内部工作原理的好文章或书籍?

Some specific things I'm interested in are:

我感兴趣的一些具体事情是:

  • What does a database actually do to find out what matches a select statement?
  • How does a database interpret a join differently to a query with several "where key1 = key2" statements?
  • How does the database store all its memory?
  • How are indexes stored?
  • 数据库实际上做了什么来找出与 select 语句匹配的内容?
  • 数据库如何以不同的方式解释连接与具有多个“where key1 = key2”语句的查询?
  • 数据库如何存储其所有内存?
  • 索引是如何存储的?

回答by Will Hartung

What does a database actually do to find out what matches a select statement?

数据库实际上做了什么来找出与 select 语句匹配的内容?

To be blunt, it's a matter of brute force. Simply, it reads through each candidate record in the database and matches the expression to the fields. So, if you have "select * from table where name = 'fred'", it literally runs through each record, grabs the "name" field, and compares it to 'fred'.

坦率地说,这是一个蛮力问题。简单地说,它读取数据库中的每个候选记录并将表达式与字段匹配。因此,如果您有“select * from table where name = 'fred'”,它会逐条遍历每条记录,获取“name”字段,并将其与 'fred' 进行比较。

Now, if the "table.name" field is indexed, then the database will (likely, but not necessarily) use the index first to locate the candidate records to apply the actual filter to.

现在,如果“table.name”字段被索引,那么数据库将(可能,但不一定)首先使用索引来定位要应用实际过滤器的候选记录。

This reduces the number of candidate records to apply the expression to, otherwise it will just do what we call a "table scan", i.e. read every row.

这减少了要应用表达式的候选记录的数量,否则它只会执行我们所谓的“表扫描”,即读取每一行。

But fundamentally, however it locates the candidate records is separate from how it applies the actual filter expression, and, obviously, there are some clever optimizations that can be done.

但从根本上说,无论它如何定位候选记录,都与它如何应用实际过滤器表达式是分开的,显然,可以进行一些巧妙的优化。

How does a database interpret a join differently to a query with several "where key1 = key2" statements?

数据库如何以不同的方式解释连接与具有多个“where key1 = key2”语句的查询?

Well, a join is used to make a new "pseudo table", upon which the filter is applied. So, you have the filter criteria and the join criteria. The join criteria is used to build this "pseudo table" and then the filter is applied against that. Now, when interpreting the join, it's again the same issue as the filter -- brute force comparisons and index reads to build the subset for the "pseudo table".

好吧,连接用于创建一个新的“伪表”,在其上应用过滤器。因此,您有过滤条件和连接条件。连接标准用于构建此“伪表”,然后对其应用过滤器。现在,在解释连接时,它又是与过滤器相同的问题——蛮力比较和索引读取来构建“伪表”的子集。

How does the database store all its memory?

数据库如何存储其所有内存?

One of the keys to good database is how it manages its I/O buffers. But it basically matches RAM blocks to disk blocks. With the modern virtual memory managers, a simpler database can almost rely on the VM as its memory buffer manager. The high end DB'S do all this themselves.

良好数据库的关键之一是它如何管理其 I/O 缓冲区。但它基本上将 RAM 块与磁盘块匹配。使用现代虚拟内存管理器,更简单的数据库几乎可以依赖 VM 作为其内存缓冲区管理器。高端数据库自己做这一切。

How are indexes stored?

索引是如何存储的?

B+Trees typically, you should look it up. It's a straight forward technique that has been around for years. It's benefit is shared with most any balanced tree: consistent access to the nodes, plus all the leaf nodes are linked so you can easily traverse from node to node in key order. So, with an index, the rows can be considered "sorted" for specific fields in the database, and the database can leverage that information to it benefit for optimizations. This is distinct from, say, using a hash table for an index, which only lets you get to a specific record quickly. In a B-Tree you can quickly get not just to a specific record, but to a point within a sorted list.

B+Trees 通常,您应该查找它。这是一种已经存在多年的直接技术。它的好处与大多数平衡树共享:对节点的一致访问,加上所有叶节点都是链接的,因此您可以轻松地按关键顺序从一个节点遍历到另一个节点。因此,通过索引,可以将行视为针对数据库中的特定字段“排序”,并且数据库可以利用该信息来为优化带来好处。这与使用哈希表作为索引不同,哈希表只能让您快速访问特定记录。在 B 树中,您不仅可以快速获取特定记录,还可以快速获取排序列表中的某个点。

The actual mechanics of storing and indexing rows in the database are really pretty straight forward and well understood. The game is managing buffers, and converting SQL in to efficient query paths to leverage these basic storage idioms.

在数据库中存储和索引行的实际机制非常简单易懂。游戏正在管理缓冲区,并将 SQL 转换为有效的查询路径以利用这些基本存储习惯用法。

Then, there's the whole multi-users, locking, logging, and transactions complexity on top of the storage idiom.

然后,在存储习惯用法之上还有整个多用户、锁定、日志记录和事务复杂性。

回答by Peter Parker

  • What does a database actually do to find out what matches a select statement?

    DBs are using indexes(see below)

  • How does a database interpret a join differently to a query with several "where key1 = key2" statements? Join Operations can be translated to binary tree operations by merging trees.

  • How does the database store all its memory?

    memorymapped filesfor faster access of their data

  • How are indexes stored?

    Internally DBs are working with B-Treesfor indexing.

  • 数据库实际上做了什么来找出与 select 语句匹配的内容?

    数据库正在使用索引(见下文)

  • 数据库如何以不同的方式解释连接与具有多个“where key1 = key2”语句的查询?连接操作可以通过合并树转换为二叉树操作。

  • 数据库如何存储其所有内存?

    内存映射文件以更快地访问其数据

  • 索引是如何存储的?

    内部数据库正在使用B 树进行索引。

This should be explained in greater details on wikipedia..

这应该在维基百科上更详细地解释..

http://en.wikipedia.org/wiki/B-tree

http://en.wikipedia.org/wiki/B-tree

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Database

回答by Turnkey

In addition to reading, it can be instructive to use the DB tools to examine the execution plan that the database uses on your queries. In addition to getting insight into how it is working, you can experiment with techniques to optimize the queries with a better feedback loop.

除了阅读之外,使用数据库工具检查数据库在查询中使用的执行计划也很有启发性。除了深入了解它的工作原理之外,您还可以尝试使用更好的反馈循环来优化查询的技术。

回答by dajobe

If you want to know more in detail, I'd recommend getting the sqlite sources and having a look at how it does it. It's complete, albeit not at the scale of the larger open source and commercial databases. If you want to know more in detail I recommend The Definitive Guide to SQLitewhich is not only a great explanation of sqlite, but also one of the most readable technical books I know. On the MySQL side, you could learn from MySQL Performance Blogas well as on the book front the O'Reilly High Performance MySQL(V2) of which the blog is one of the authors.

如果您想更详细地了解,我建议您获取 sqlite 源代码并查看它是如何工作的。它是完整的,尽管没有大型开源和商业数据库的规模。如果你想更详细地了解,我推荐SQLite 权威指南,它不仅是对 sqlite 的一个很好的解释,也是我所知道的最易读的技术书籍之一。在 MySQL 方面,您可以从MySQL Performance Blog以及本书前面的 O'Reilly High Performance MySQL(V2) 中学习,该博客是该博客的作者之一。

回答by SquareCog

Saif, excellent link. A bird's eye overview that manages to cover most topics, and provide details on specific vendor implementations.

赛义夫,很好的链接。鸟瞰概览,涵盖了大多数主题,并提供了有关特定供应商实施的详细信息。

I made three tries at writing an explanation, but this is really too big a topic. Check out the Hellerstein article (the one on the berkeley server that Saif linked to), and then ask about specifics.

我尝试写了三个解释,但这真的是一个太大的话题。查看 Hellerstein 文章(Saif 链接到的伯克利服务器上的文章),然后询问细节。

It's worth noting that only a subset of "known good ideas" is implemented in any given DBMS. For example, SQLite doesn't even do hash joins, it only does nested loops (ack!!). But then, it's an easily embeddable dbms, and it does its work very well, so there's something to be said for the lack of complexity.

值得注意的是,在任何给定的 DBMS 中都只实现了“已知的好主意”的一个子集。例如,SQLite 甚至不执行哈希连接,它只执行嵌套循环(确认!!)。但是,它是一个易于嵌入的 dbms,并且它的工作非常好,因此对于缺乏复杂性有一些话要说。

Learning about how a DBMS gathers statistics and how it uses them to construct query plans, as well as learning how to read the query plans in the first place, is an invaluable skill -- if you have to choose one "database internals" topic to learn, learn this. It will make a world of difference (and you will never accidentally write a Cartesian product again... ;-)).

了解 DBMS 如何收集统计信息以及它如何使用它们来构建查询计划,以及首先学习如何阅读查询计划,这是一项非常宝贵的技能——如果您必须选择一个“数据库内部”主题来学习,学习这个。它将使世界变得不同(并且您永远不会意外地再次编写笛卡尔积...... ;-))。