database 非规范化以什么方式提高数据库性能?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2349270/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:37:01  来源:igfitidea点击:

In what way does denormalization improve database performance?

databaseperformancedatabase-designnormalizationdenormalization

提问by Roman

I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.

我听说了很多关于为了提高某些应用程序的性能而进行的非规范化。但我从未尝试过做任何相关的事情。

So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?

所以,我只是好奇,规范化数据库中的哪些地方会使性能变差,或者换句话说,什么是非规范化原则?

How can I use this technique if I need to improve performance?

如果我需要提高性能,我该如何使用这种技术?

回答by Pascal MARTIN

Denormalization is generally used to either:

非规范化通常用于:

  • Avoid a certain number of queries
  • Remove some joins
  • 避免一定数量的查询
  • 删除一些连接

The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.

非规范化的基本思想是,您将添加冗余数据或将一些数据分组,以便能够更轻松地获取这些数据——以更低的成本;这更适合表演。


A quick examples?


一个简单的例子?

  • Consider a "Posts" and a "Comments" table, for a blog
    • For each Post, you'll have several lines in the "Comment" table
    • This means that to display a list of posts with the associated number of comments, you'll have to:
      • Do one query to list the posts
      • Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
      • Which means several queries.
  • Now, if you add a "number of comments" field into the Posts table:
    • You only need one query to list the posts
    • And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
    • And only one query that returns one more field is better than more queries.
  • 考虑一个博客的“帖子”和“评论”表
    • 对于每个帖子,您将在“评论”表中有几行
    • 这意味着要显示具有相关评论数量的帖子列表,您必须:
      • 做一个查询以列出帖子
      • 对每个帖子执行一次查询以计算它有多少评论(是的,这些评论只能合并为一个,以一次获取所有帖子的数量)
      • 这意味着几个查询。
  • 现在,如果您在 Posts 表中添加一个“评论数量”字段:
    • 您只需要一个查询即可列出帖子
    • 并且不需要查询 Comments 表:评论的数量已经非规范化到 Posts 表了。
    • 并且只有一个多返回一个字段的查询比更多的查询要好。

Now, there are some costs, yes:

现在,有一些成本,是的:

  • First, this costs some place on both disk and in memory, as you have some redundant informations:
    • The number of comments are stored in the Posts table
    • And you can also find those number counting on the Comments table
  • Second, each time someone adds/removes a comment, you have to:
    • Save/delete the comment, of course
    • But also, update the corresponding number in the Posts table.
    • But, if your blog has a lot more people reading than writing comments, this is probably not so bad.
  • 首先,这会在磁盘和内存中占用一些空间,因为您有一些冗余信息:
    • 评论数存储在 Posts 表中
    • 您还可以在“评论”表中找到这些数字
  • 其次,每次有人添加/删除评论时,您必须:
    • 保存/删除评论,当然
    • 而且,更新 Posts 表中的相应数字。
    • 但是,如果您的博客阅读的人数多于撰写评论的人数,这可能还不错。

回答by ewernli

Denormalization is a time-spacetrade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.

非规范化是一种时空权衡。规范化数据占用较少空间,但可能需要连接以构建所需的结果集,因此需要更多时间。如果它是非规范化的,数据会在几个地方复制。然后它会占用更多空间,但可以轻松获得所需的数据视图。

There are other time-space optimizations, such as

还有其他时空优化,例如

  • denormalized view
  • precomputed columns
  • 非规范化视图
  • 预计算列

As with any of such approach, this improves reading data(because they are readily available), but updating databecomes more costly (because you need to update the replicated or precomputed data).

与任何此类方法一样,这改进了读取数据(因为它们随时可用),但更新数据变得更加昂贵(因为您需要更新复制或预先计算的数据)。

回答by Walter Mitty

The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.

“非规范化”一词会导致设计问题的混淆。试图通过非规范化获得高性能数据库就像试图驾车离开纽约到达目的地。它不会告诉你该走哪条路。

What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.

你需要的是一个良好的设计纪律,一个能产生简单而合理的设计,即使这种设计有时会与规范化规则相冲突。

One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.

一种这样的设计原则是星型模式。在星型模式中,单个事实表充当星型表的中心。其他表称为维度表,它们位于模式的边缘。维度通过看起来像轮辐的关系连接到事实表。星型模式基本上是一种将多维设计投影到 SQL 实现上的方法。

Closely related to star schema is snowflake schema, which is a little more complicated.

与星型模式密切相关的是雪花模式,它稍微复杂一些。

If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.

如果您有一个良好的星型模式,您将能够通过不超过三向连接的方式获得各种数据组合,涉及两个维度和一个事实表。不仅如此,许多 OLAP 工具将能够自动解读您的星型设计,并让您无需进一步编程即可通过点击、向下钻取和图形分析访问您的数据。

Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".

星型模式设计偶尔会违反第二和第三范式,但它会提高报告和摘录的速度和灵活性。它最常用于数据仓库、数据集市和报告数据库。星型模式或其他一些面向检索的设计通常会比偶然的“非规范化”获得更好的结果。

回答by HLGEM

The critical issues in denormalizing are:

非规范化的关键问题是:

  • Deciding what data to duplicate and why
  • Planning how to keep the data in synch
  • Refactoring the queries to use the denormalized fields.
  • 决定复制哪些数据以及为什么
  • 规划如何保持数据同步
  • 重构查询以使用非规范化字段。

One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.

最简单的非规范化类型之一是将标识字段填充到表中以避免连接。由于身份永远不会改变,这意味着保持数据同步的问题很少出现。例如,我们将我们的客户 id 填充到几个表中,因为我们经常需要通过客户来查询它们,并且在查询中不一定需要在客户表和我们正在查询的表之间的表中的任何数据如果数据完全标准化。您仍然需要执行一次连接来获取客户端名称,但这比连接到 6 个父表来获取客户端名称要好,因为这是您需要从正在查询的表外部获取的唯一数据。

However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.

但是,除非我们经常在需要来自中间表的数据的地方进行查询,否则这样做没有任何好处。

Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.

另一种常见的非规范化可能是向其他表添加名称字段。由于名称本质上是可变的,因此您需要确保名称与触发器保持同步。但是,如果这样可以避免加入 5 个表而不是 2 个表,那么花费稍长插入或更新的成本是值得的。

回答by marc_s

If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:

如果您有某些要求,例如报告等,它可以帮助您以各种方式对数据库进行非规范化:

  • introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)

  • you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)

  • 引入某些数据重复以节省一些JOIN(例如将某些信息填充到一个表中并可以处理重复的数据,以便该表中的所有数据不需要通过加入另一个表来找到)

  • 您可以预先计算某些值并将它们存储在表列中,从而避免在每次查询数据库时动态计算它们。当然,这些计算出的值可能会随着时间的推移变得“陈旧”,您可能需要在某个时候重新计算它们,但是仅仅读出一个固定值通常比计算某些东西(例如计算子行)便宜

There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.

当然有更多的方法可以对数据库模式进行非规范化以提高性能,但您只需要意识到这样做确实会给自己带来一定程度的麻烦。在做出这些决定时,您需要仔细权衡利弊 - 性能优势与您遇到的问题。

回答by S.Lott

Consider a database with a properly normalized parent-child relationship.

考虑一个具有正确规范化的父子关系的数据库。

Let's say the cardinality is an average of 2x1.

假设基数的平均值为 2x1。

You have two tables, Parent, with prows. Child with 2x prows.

您有两个表,Parent,有p行。具有 2x p行的子项。

The join operation means for pparent rows, 2x pchild rows must be read. The total number of rows read is p+ 2x p.

连接操作意味着对于p 个父行,必须读取2x p个子行。读取的总行数为p+ 2x p

Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.

考虑将其非规范化为一个只有子行 2x p 的表。读取的行数是 2x p

Fewer rows == less physical I/O == faster.

更少的行 == 更少的物理 I/O == 更快。

回答by RJCurrie

As per the last section of this article,

根据本文的最后一节,

https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx

https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx

one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.

可以使用 Virtual Denormalization,您可以使用一些非规范化数据创建视图以更快地运行更简单的 SQL 查询,而底层表保持规范化以加快添加/更新操作(只要您可以定期更新视图而不是比实时)。我自己只是在上一门关于关系数据库的课程,但是,从我所读的内容来看,这种方法对我来说似乎是合乎逻辑的。

回答by Kirtish Srivastava

Benefits of de-normalization over normalization

去规范化相对于规范化的好处

Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.

基本上去规范化用于 DBMS 而不是 RDBMS。正如我们所知,RDBMS 与规范化一起工作,这意味着不会一次又一次地重复数据。但是在使用外键的时候还是会重复一些数据。

When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.

当您使用 DBMS 时,需要删除规范化。为此,需要重复。但是,它仍然提高了性能,因为表之间没有关系,每个表都不可分割地存在。