database 数据库连接何时以及为何成本高昂?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/173726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:00:18  来源:igfitidea点击:

When and why are database joins expensive?

databaseperformancejoinrelational-databasedenormalization

提问by Rik

I'm doing some research into databases and I'm looking at some limitations of relational DBs.

我正在对数据库进行一些研究,并且正在研究关系数据库的一些局限性。

I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck?
How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?

我发现大表的连接非常昂贵,但我不完全确定为什么。DBMS执行join操作需要做什么,瓶颈在哪里?
非规范化如何帮助克服这种开销?其他优化技术(例如索引)有何帮助?

Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.

欢迎个人经验!如果您要发布资源链接,请避免使用维基百科。我已经知道在哪里可以找到了。

In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.

与此相关,我想知道 BigTable 和 SimpleDB 等云服务数据库使用的非规范化方法。看到这个问题

回答by Peter Wone

Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

非规范化以提高性能?这听起来很有说服力,但它不成立。

Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and testedthese assertions.

Chris Date 与 Ted Codd 博士一起是关系数据模型的最初支持者,他对反对规范化的错误观点失去了耐心,并使用科学方法系统地摧毁了它们:他获得了大型数据库并测试了这些断言。

I think he wrote it up in Relational Database Writings 1988-1991but this book was later rolled into edition six of Introduction to Database Systems, which is thedefinitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.

我认为他是在Relational Database Writings 1988-1991 中写的,但这本书后来被编入了数据库系统导论的第六版,这是数据库理论和设计权威教材,在我写的第八版中,可能会保留下来在未来几十年印刷。当我们大多数人还赤脚跑来跑去时,Chris Date 是该领域的专家。

He found that:

他发现:

  • Some of them hold for special cases
  • All of them fail to pay off for general use
  • All of them are significantly worse for other special cases
  • 其中一些适用于特殊情况
  • 所有这些都无法支付一般用途
  • 对于其他特殊情况,所有这些都明显更糟

It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result beforethe rows are materialised.

这一切都回到了减轻工作集的大小。包含正确选择的键和正确设置的索引的连接是便宜的,而不是昂贵的,因为它们允许在行实现之前对结果进行大量修剪。

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

实现结果涉及批量磁盘读取,这是该练习中最昂贵的一个数量级。相比之下,执行连接在逻辑上只需要检索。在实践中,甚至不获取键值:键哈希值用于连接比较,减轻多列连接的成本并从根本上降低涉及字符串比较的连接成本。不仅会更适合缓存,而且需要做的磁盘读取也少得多。

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

此外,一个好的优化器会选择最严格的条件并在它执行连接之前应用它,非常有效地利用高基数索引上连接的高选择性。

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who wantto denormalise a schema typically don't think about cardinality when (if) they set up indexes.

诚然,这种类型的优化也可以应用于非规范化数据库,但是那些想要对模式进行非规范化的人通常不会在(如果)设置索引时考虑基数。

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

了解表扫描(在生成连接的过程中检查表中的每一行)在实践中很少见,了解这一点很重要。仅当以下一项或多项成立时,查询优化器才会选择表扫描。

  • There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
  • There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
  • A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
  • One of the arguments of the comparison is an expression (no index)
  • 关系中的行少于 200 行(在这种情况下,扫描会更便宜)
  • 连接列上没有合适的索引(如果连接这些列有意义,那么为什么不将它们编入索引?修复它)
  • 在可以比较列之前需要类型强制转换(WTF?!修复它或回家)请参阅 ADO.NET 问题的结束注释
  • 比较的参数之一是表达式(无索引)

Performing an operation is more expensive than not performing it. However, performing the wrongoperation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is muchmore expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a differentjoin, that commitment is going to cost you big.

执行操作比不执行更昂贵。但是,执行错误的操作,被迫进行无意义的磁盘 I/O,然后在执行真正需要的连接之前丢弃渣滓,代价高得多。即使预先计算了“错误”的操作并且合理地应用了索引,仍然存在显着的惩罚。非规范化以预先计算连接 - 尽管需要更新异常 - 是对特定连接的承诺。如果您需要不同的连接,那么该承诺将使您付出巨大的代价。

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

如果有人想提醒我这是一个不断变化的世界,我想你会发现在更粗糙的硬件上更大的数据集只会夸大 Date 发现的传播范围。

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process allof the data, in-order. It's not a general case, and you arejustified in your strategy.

对于所有在计费系统或垃圾邮件生成器上工作的人(真为你感到羞耻)并且愤怒地把手放在键盘上告诉我你知道非规范化更快的事实,对不起,但你生活在一个特殊的地方案例 - 特别是,您按顺序处理所有数据的案例。这不是一般情况,您的策略合理的。

You are notjustified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

没有理由错误地概括它。有关在数据仓库场景中正确使用非规范化的更多信息,请参阅注释部分的末尾。

I'd also like to respond to

我也想回复

Joins are just cartesian products with some lipgloss

联接只是带有一些唇彩的笛卡尔积

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treatedas "cartesian products to which predicates apply" onlyby the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

真是一堆废话。尽早应用限制,最先应用限制。你已经阅读了理论,但你还没有理解它。连接被处理为“对哪些谓词用于笛卡尔产品”由查询优化器。这是一种符号表示(实际上是一种规范化)以促进符号分解,因此优化器可以生成所有等效转换并按成本和选择性对它们进行排名,以便它可以选择最佳查询计划。

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B

让优化器生成笛卡尔积的唯一方法是不提供谓词: SELECT * FROM A,B



Notes

笔记



David Aldridge provides some important additional information.

David Aldridge 提供了一些重要的附加信息。

There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.

除了索引和表扫描之外,确实还有多种其他策略,现代优化器将在生成执行计划之前将它们全部消耗掉。

A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is availableto the optimiser.

一条实用的建议:如果它可以用作外键,则对其进行索引,以便优化器可以使用索引策略。

I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.

我曾经比 MSSQL 优化器更聪明。这在两个版本前发生了变化。现在它一般教。在非常真实的意义上,它是一个专家系统,它将许多非常聪明的人的所有智慧编入一个足够封闭的领域,以至于基于规则的系统是有效的。



"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.

“Bollocks”可能不够圆滑。我被要求不要那么傲慢,并提醒我数学不会说谎。这是真的,但并非数学模型的所有含义都必须从字面上理解。如果你小心地避免检查它们的荒谬性(双关语)并且在尝试解释你的方程之前确保将它们全部取消,那么负数的平方根非常方便。

The reason that I responded so savagely was that the statement as worded says that

我如此野蛮地回应的原因是声明的措辞是这样说的

Joins arecartesian products...

联接笛卡尔积...

This may not be what was meant but it iswhat was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.

这可能不是什么意思,但它所写的,而且绝对是不真实的。笛卡尔积是一种关系。连接是一个函数。更具体地说,连接是一个关系值函数。使用空谓词,它将产生笛卡尔积,并且检查它是否这样做是对数据库查询引擎的一种正确性检查,但没有人在实践中编写无约束连接,因为它们在课堂之外没有实际价值。

I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.

我之所以这么说是因为我不希望读者陷入将模型与建模的事物混淆的古老陷阱。模型是一种近似值,为方便操作而特意简化。



The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of klog n+ c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until nis in the hundreds.

选择表扫描连接策略的截止点可能因数据库引擎而异。它受许多实现决策的影响,例如树节点填充因子、键值大小和算法的微妙之处,但广义上讲,高性能索引的执行时间为klog n+ c。C 项是主要由设置时间构成的固定开销,曲线的形状意味着在n达到数百之前您不会得到回报(与线性搜索相比)。



Sometimes denormalisation is a good idea

有时非规范化是个好主意

Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with otherjoin strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.

非规范化是对特定连接策略的承诺。如前所述,这会干扰其他连接策略。但是,如果您有大量磁盘空间、可预测的访问模式以及处理大部分或全部磁盘空间的趋势,那么预先计算连接可能非常值得。

You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.

您还可以找出您的操作通常使用的访问路径,并预先计算这些访问路径的所有连接。这是数据仓库背后的前提,或者至少是当它们由知道为什么要做他们正在做的事情的人构建时,而不仅仅是为了遵守流行语。

A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).

设计合理的数据仓库是通过从规范化的事务处理系统中进行批量转换而定期生成的。这种操作和报告数据库的分离对于消除 OLTP 和 OLAP(在线事务处理,即数据输入,以及在线分析处理,即报告)之间的冲突具有非常理想的效果。

An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.

这里重要的一点是,除了定期更新之外,数据仓库是只读的。这使得更新异常的问题没有实际意义。

Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?

不要犯非规范化 OLTP 数据库(发生数据输入的数据库)的错误。计费运行可能会更快,但如果你这样做,你会得到更新异常。曾经试图让读者文摘停止向您发送东西吗?

Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's alwaysabout reducing the working set.

这些天磁盘空间很便宜,所以让自己振作起来。但非规范化只是数据仓库故事的一部分。更大的性能提升来自预先计算的汇总值:每月总计,诸如此类。它总是关于减少工作集。



ADO.NET problem with type mismatches

类型不匹配的 ADO.NET 问题

Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.

假设您有一个包含 varchar 类型索引列的 SQL Server 表,并且您使用 AddWithValue 传递一个参数来限制对该列的查询。C# 字符串是 Unicode,因此推断的参数类型将为 NVARCHAR,它与 VARCHAR 不匹配。

VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.

VARCHAR 到 NVARCHAR 是一种扩大转换,因此它是隐式发生的 - 但告别索引,祝你好运找出原因。



"Count the disk hits" (Rick James)

“计算磁盘命中数”(里克·詹姆斯)

If everything is cached in RAM, JOINsare rather cheap. That is, normalization does not have much performance penalty.

如果一切都缓存在 RAM 中,JOINs则相当便宜。也就是说,规范化没有太多的性能损失

If a "normalized" schema causes JOINsto hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.

如果“规范化”模式导致JOINs大量访问磁盘,但等效的“非规范化”模式不必访问磁盘,那么非规范化将赢得性能竞争。

Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.

原作者评论:现代数据库引擎非常擅长组织访问顺序,以最大限度地减少连接操作期间的缓存未命中。以上虽然是正确的,但可能会被误解为暗示连接在大数据上必然会带来问题。这将导致缺乏经验的开发人员做出错误的决策。

回答by David Aldridge

What most commenters fail to note is the wide range of join methodologies available in a complex RDBMS, and the denormalisers invariably gloss over the higher cost of maintaining denormalised data. Not every join is based on indexes, and databases have a lot of optimised algotithms and methodologies for joining that are intended to reduce join costs.

大多数评论者没有注意到复杂 RDBMS 中可用的连接方法范围广泛,而非规范化器总是掩盖维护非规范化数据的更高成本。并非每个连接都基于索引,并且数据库具有许多用于连接的优化算法和方法,旨在降低连接成本。

In any case, the cost of a join depends on its type and a few other factors. It needn't be expensive at all - some examples.

在任何情况下,连接的成本取决于其类型和其他一些因素。它根本不需要很贵 - 一些例子。

  • A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help.
  • The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort.
  • The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins.
  • A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.
  • 散列连接,其中大量数据是等连接的,确实非常便宜,并且只有当散列表不能缓存在内存中时,成本才会变得显着。不需要索引。连接的数据集之间的均等分区可以提供很大帮助。
  • 排序合并连接的成本是由排序成本驱动的,而不是由合并成本驱动的——基于索引的访问方法实际上可以消除排序成本。
  • 索引上的嵌套循环连接的成本由 b 树索引的高度和表块本身的访问驱动。它很快,但不适合批量连接。
  • 基于集群的嵌套循环连接要便宜得多,每个连接行所需的逻辑 IO 更少——如果连接的表都在同一个集群中,那么通过连接行的共置,连接变得非常便宜。

Databases are designed to join, and they're very flexible in how they do it and generally very performant unless they get the join mechanism wrong.

数据库旨在加入,并且它们在如何执行方面非常灵活并且通常非常高效,除非他们弄错了加入机制。

回答by Joel Coehoorn

I think the whole question is based on a false premise. Joins on large tables are notnecessarily expensive. In fact, doing joins efficiently is one of the main reasons relational databases existat all. Joins on large setsoften are expensive, but very rarely do you want to join the entire contents of large table A with the entire contents of large table B. Instead, you write the query such that only the important rowsof each table are used and the actual set kept by the join remains smaller.

我认为整个问题都是基于一个错误的前提。加入大表是不是要贵。事实上,有效地进行连接是关系数据库存在的主要原因之一。大型集合上的联接通常是昂贵的,但您很少希望将大型表 A 的全部内容与大型表 B 的全部内容进行联接。相反,您编写查询以便使用每个表的重要行并连接保留的实际集合仍然较小。

Additionally, you have the efficiencies mentioned by Peter Wone, such that only the important parts of each record need be in memory until the final result set is materialized. Also, in large queries with many joins you typically want to start with the smaller table sets and work your way up to the large ones, so that the set kept in memory remains as small as possible as long as possible.

此外,您还有 Peter Wone 提到的效率,这样在最终结果集实现之前,只有每条记录的重要部分需要在内存中。此外,在具有许多连接的大型查询中,您通常希望从较小的表集开始,然后逐步增加到较大的表集,以便尽可能长时间地保持内存中的集尽可能小。

When done properly, joins are generally the best wayto compare, combine, or filter on large amounts of data.

如果操作得当,联接通常是比较、组合或过滤大量数据的最佳方式

回答by Mark Brackett

The bottleneck is pretty much alwaysdisk I/O, and even more specifically - random disk I/O (by comparison, sequential reads are fairly fast and can be cached with read ahead strategies).

瓶颈几乎总是磁盘 I/O,更具体地说 - 随机磁盘 I/O(相比之下,顺序读取相当快,可以使用预读策略进行缓存)。

Joins canincrease random seeks - if you're jumping around reading small parts of a large table. But, query optimizers look for that and will turn it into a sequential table scan (discarding the unneeded rows) if it thinks that'd be better.

连接可以增加随机搜索 - 如果您正在阅读大表的小部分。但是,查询优化器会寻找它并将其转换为顺序表扫描(丢弃不需要的行),如果它认为这样会更好。

A single denormalized table has a similar problem - the rows are large, and so less fit on a single data page. If you need rows that are located far from another (and the large row size makes them further apart) then you'll have more random I/O. Again, a table scan may be forced to avoid this. But, this time, your table scan has to read more data because of the large row size. Add to that the fact that you're copying datafrom a single location to multiple locations, and the RDBMS has that much more to read (and cache).

单个非规范化表具有类似的问题 - 行很大,因此不太适合单个数据页。如果您需要远离另一个的行(并且大行大小使它们相距更远),那么您将拥有更多的随机 I/O。同样,可能会强制执行表扫描以避免这种情况。但是,这一次,由于行大小很大,您的表扫描必须读取更多数据。再加上您将数据从单个位置复制到多个位置的事实,RDBMS 有更多要读取(和缓存)的内容。

With 2 tables, you also get 2 clustered indexes - and can generally index more (because of less insert/update overhead) which can get you drastically increased performance (mainly, again, because indexes are (relatively) small, quick to read off disk (or cheap to cache), and lessen the amount of table rows you need to read from disk).

使用 2 个表,您还可以获得 2 个聚集索引 - 通常可以索引更多(因为插入/更新开销较少),这可以大大提高性能(主要是再次,因为索引(相对)较小,可以快速读取磁盘(或缓存便宜),并减少您需要从磁盘读取的表行数)。

About the only overhead with a join comes from figuring out the matching rows. Sql Server uses 3 different types of joins, mainly based on dataset sizes, to find matching rows. If the optimizer picks the wrong join type (due to inaccurate statistics, inadequate indexes, or just an optimizer bug or edge case) it can drastically affect query times.

连接的唯一开销来自找出匹配的行。Sql Server 使用 3 种不同类型的连接,主要基于数据集大小来查找匹配的行。如果优化器选择了错误的连接类型(由于不准确的统计信息、索引不足,或者仅仅是优化器错误或边缘情况),它会极大地影响查询时间。

  • A loop join is farily cheap for (at least 1) small dataset.
  • A merge join requires a sort of both datasets first. If you join on an indexed column, though, then the index is already sorted and no further work needs to be done. Otherwise, there is some CPU and memory overhead in sorting.
  • The hash join requires both memory (to store the hashtable) and CPU (to build the hash). Again, this is fairly quick in relation to the disk I/O. However, if there's not enough RAM to store the hashtable, Sql Server will use tempdb to store parts of the hashtable and the found rows, and then process only parts of the hashtable at a time. As with all things disk, this is fairly slow.
  • 对于(至少 1 个)小数据集,循环连接非常便宜。
  • 合并连接首先需要对两个数据集进行排序。但是,如果您在索引列上加入,那么索引已经排序,不需要做进一步的工作。否则,排序时会有一些 CPU 和内存开销。
  • 哈希连接需要内存(用于存储哈希表)和 CPU(用于构建哈希)。同样,这相对于磁盘 I/O 来说相当快。但是,如果没有足够的 RAM 来存储哈希表,Sql Server 将使用 tempdb 存储哈希表的一部分和找到的行,然后一次只处理哈希表的一部分。与所有磁盘一样,这相当慢。

In the optimal case, these cause no disk I/O - and so are negligible from a performance perspective.

在最佳情况下,这些不会导致磁盘 I/O - 因此从性能角度来看可以忽略不计。

All in all, at worst - it should actually be faster to read the same amount of logicaldata from x joined tables, as it is from a single denormalized table because of the smaller disk reads. To read the same amount of physicaldata, there could be some slight overhead.

总而言之,在最坏的情况下 -从 x 连接表中读取相同数量的逻辑数据实际上应该更快,因为它是从单个非规范化表中读取的,因为磁盘读取量较小。要读取相同数量的物理数据,可能会有一些轻微的开销。

Since query time is usually dominated by I/O costs, and the size of your data does not change (minus some very miniscule row overhead) with denormalization, there's not a tremendous amount of benefit to be had by just merging tables together. The type of denormalization that tends to increase performance, IME, is caching calculated values instead of reading the 10,000 rows required to calculate them.

由于查询时间通常由 I/O 成本决定,并且您的数据大小不会随着非规范化而改变(减去一些非常小的行开销),因此仅将表合并在一起并不会带来太多好处。倾向于提高性能的非规范化类型 IME 是缓存计算值而不是读取计算它们所需的 10,000 行。

回答by Ilya Kochetov

The order in which you're joining the tables is extremely important. If you have two sets of data try to build the query in a way so the smallest will be used first to reduce the amount of data the query has to work on.

您加入表格的顺序非常重要。如果您有两组数据,请尝试以某种方式构建查询,因此将首先使用最小的数据集以减少查询必须处理的数据量。

For some databases it does not matter, for example MS SQL does know the proper join order most of the time. For some (like IBM Informix) the order makes all the difference.

对于某些数据库,这无关紧要,例如 MS SQL 在大多数情况下确实知道正确的连接顺序。对于某些(如 IBM Informix)来说,顺序决定了一切。

回答by MathGladiator

Deciding on whether to denormalize or normalize is fairly a straightforward process when you consider the complexity class of the join. For instance, I tend to design my databases with normalization when the queries are O(k log n) where k is relative to the desired output magnitude.

当您考虑连接的复杂性类时,决定是非规范化还是规范化是一个相当简单的过程。例如,当查询为 O(k log n) 时,我倾向于使用规范化设计我的数据库,其中 k 相对于所需的输出量级。

An easy way to denormalize and optimize performance is to think about how changes to your normalize structure affect your denormalized structure. It can be problematic however as it may require transactional logic to work on a denormalized structured.

非规范化和优化性能的一种简单方法是考虑对规范化结构的更改如何影响非规范化结构。然而,它可能有问题,因为它可能需要事务逻辑来处理非规范化的结构化。

The debate for normalization and denormalization isn't going to end since the problems are vast. There are many problems where the natural solution requires both approaches.

规范化和非规范化的争论不会结束,因为问题很多。许多问题的自然解决方案需要两种方法。

As a general rule, I've always stored a normalized structure and denormalized caches that can be reconstructed. Eventually, these caches save my ass to solve the future normalization problems.

作为一般规则,我总是存储可以重建的规范化结构和非规范化缓存。最终,这些缓存拯救了我来解决未来的规范化问题。

回答by questzen

Elaborating what others have said,

细化别人说的话,

Joins are just cartesian products with some lipgloss. {1,2,3,4}X{1,2,3} would give us 12 combinations (nXn=n^2). This computed set acts as a reference on which conditions are applied. The DBMS applies the conditions (like where both left and right are 2 or 3) to give us the matching condition(s). Actually it is more optimised but the problem is the same. The changes to size of the sets would increase the result size exponentially. The amount of memory and cpu cycles consumed all are effected in exponential terms.

联接只是带有一些唇彩的笛卡尔积。{1,2,3,4}X{1,2,3} 会给我们 12 种组合(nXn=n^2)。该计算集用作应用条件的参考。DBMS 应用条件(例如左右都是 2 或 3)来为我们提供匹配条件。实际上它更优化了,但问题是一样的。集合大小的变化会以指数方式增加结果大小。消耗的内存量和 CPU 周期都以指数形式影响。

When we denormalise, we avoid this computation altogether, think of having a colored sticky, attached to every page of your book. You can infer the information with out using a reference. The penalty we pay is that we are compromising the essence of DBMS (optimal organisation of data)

当我们非规范化时,我们会完全避免这种计算,想想在你的书的每一页上贴上一张彩色便签。您可以在不使用参考的情况下推断出信息。我们付出的代价是我们损害了 DBMS(数据的最佳组织)的本质