SQL 为什么在考虑可扩展性时联接不好?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2623852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why are joins bad when considering scalability?
提问by Joel Coehoorn
Why are joins bad or 'slow'. I know i heard this more then once. I found this quote
为什么连接不好或“慢”。我知道我不止一次听到这个。我找到了这个报价
The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again.
问题是连接速度相对较慢,尤其是在非常大的数据集上,如果它们很慢,您的网站就会很慢。从磁盘中取出所有这些单独的信息位并将它们重新组合在一起需要很长时间。
I always thought they were fast especially when looking up a PK. Why are they 'slow'?
我一直认为他们很快,尤其是在查找 PK 时。他们为什么“慢”?
回答by Joel Coehoorn
Scalability is all about pre-computing, spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.
可扩展性是关于预先计算、分散或削减重复工作到基本要素,以最大限度地减少每个工作单元的资源使用。为了更好地扩展,你不需要大量做任何你不需要的事情,而你实际做的事情你要确保尽可能高效地完成。
In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.
在这种情况下,当然,加入两个独立的数据源相对较慢,至少与不加入它们相比,因为这是您需要在用户请求时实时完成的工作。
But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.
但请记住,替代方案根本不再有两个单独的数据;您必须将两个不同的数据点放在同一个记录中。您不能将两个不同的数据组合在一起而不在某处产生后果,因此请确保您了解权衡。
The good news is modern relational databases are goodat joins. You shouldn't really think of joins as slow with a good database used well. The database provides a number of scalability-friendly ways to take raw joins and make them muchfaster:
好消息是现代关系数据库擅长连接。如果使用良好的数据库,您不应该真的认为连接很慢。该数据库提供了许多可扩展性,友好的方式吃生的加入,让他们多快:
- Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
- Indexes
- Materialized/indexed views (think of this as a pre-computed join or managedde-normalization)
- Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
- Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
- OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as genericdenormalization)
- Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
- 加入代理键(自动编号/身份列)而不是自然键。这意味着在连接操作期间进行更小(因此更快)的比较
- 索引
- 物化/索引视图(将其视为预先计算的连接或托管的反规范化)
- 计算列。您可以使用它来散列或以其他方式预先计算连接的键列,这样连接的复杂比较现在要小得多并且可能预先索引。
- 表分区(通过将负载分散到多个磁盘,或将可能是表扫描的内容限制为分区扫描来帮助处理大型数据集)
- OLAP(预先计算某些类型的查询/连接的结果。这不太正确,但您可以将其视为通用非规范化)
- 复制、可用性组、日志传送或其他机制,让多个服务器回答对同一数据库的读取查询,从而在多个服务器之间扩展您的工作负载。
I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).
我会说关系数据库存在的主要原因是允许您有效地进行连接*。这当然不仅仅是存储结构化数据(您可以使用像 csv 或 xml 这样的平面文件结构来做到这一点)。我列出的一些选项甚至可以让您提前完全构建您的连接,因此在您发出查询之前已经完成了结果 - 就像您对数据进行了非规范化(无可否认,以较慢的写入操作为代价)。
If you have a slow join, you're probably not using your database correctly.
如果您的联接速度较慢,则您可能没有正确使用数据库。
De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.
只有在这些其他技术失败后才应进行反规范化。真正判断“失败”的唯一方法是设定有意义的绩效目标并根据这些目标进行衡量。如果你还没有测量过,现在考虑去规范化还为时过早。
* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.
* 也就是说,作为不同于单纯的表格集合的实体存在。真正的 rdbms 的另一个原因是安全的并发访问。
回答by Tendayi Mawushe
Joins can be slowerthan avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.
连接可能比通过反规范化避免它们更慢,但如果使用正确(连接具有适当索引的列等),它们本质上并不慢。
De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.
如果您精心设计的数据库模式出现性能问题,则反规范化是您可以考虑的众多优化技术之一。
回答by Andrey
article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)
文章说,与没有连接相比,它们很慢。这可以通过非规范化来实现。所以在速度和规范化之间有一个权衡。也不要忘记过早的优化:)
回答by Ed Lucas
First of all, a relational database's raison d'etre (reason for being) is to be able to model relationships between entities. Joins are simply the mechanisms by which we traverse those relationships. They certainly do come at a nominal cost, but without joins, there really is no reason to have a relational database.
首先,关系数据库的 raison d'etre(存在的理由)是能够对实体之间的关系进行建模。联接只是我们遍历这些关系的机制。它们确实以象征性的成本提供,但如果没有连接,就没有理由拥有关系数据库。
In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).
在学术界,我们学习各种范式(1st、2nd、3rd、Boyce-Codd 等),我们了解不同类型的键(主键、外键、备用键、唯一键等)以及如何这些东西结合在一起设计了一个数据库。我们学习 SQL 的基础知识以及操作结构和数据(DDL 和 DML)。
In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).
在企业界,事实证明,许多学术结构的可行性远不如我们所认为的。一个完美的例子是主键的概念。在学术上,它是唯一标识表中一行的属性(或属性集合)。因此,在许多问题领域中,正确的学术主键是 3 或 4 个属性的组合。但是,现代企业界中的几乎每个人都使用自动生成的顺序整数作为表的主键。为什么?两个原因。第一个是因为当您到处迁移 FK 时,它会使模型更加清晰。第二个,也是与这个问题最密切相关的是,通过连接检索数据在单个整数上比在 4 个 varchar 列上更快、更有效(正如一些人已经提到的)。
Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.
现在让我们更深入地研究现实世界数据库的两个特定子类型。第一种类型是事务数据库。这是许多电子商务或内容管理应用程序驱动现代网站的基础。使用事务数据库,您正在大量优化“事务吞吐量”。大多数商业或内容应用程序必须平衡查询性能(来自某些表)和插入性能(在其他表中),尽管每个应用程序都有自己独特的业务驱动问题需要解决。
The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.
第二种真实世界数据库是报告数据库。这些几乎专门用于聚合业务数据和生成有意义的业务报告。它们的形状通常与生成数据的事务数据库不同,并且它们针对批量数据加载 (ETL) 的速度和大型或复杂数据集的查询性能进行了高度优化。
In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.
在每种情况下,开发人员或 DBA 都需要仔细平衡功能和性能曲线,并且在等式两边都有许多性能增强技巧。在 Oracle 中,您可以执行所谓的“解释计划”,以便您可以具体了解查询是如何解析和执行的。您希望最大限度地提高数据库对索引的正确使用。一个非常讨厌的禁忌是在查询的 where 子句中放置一个函数。无论何时这样做,您都可以保证 Oracle 不会在该特定列上使用任何索引,并且您可能会在解释计划中看到完整或部分表扫描。这只是一个特定的示例,说明如何编写最终变慢的查询,并且它与连接没有任何关系。
And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.
当我们谈论表扫描时,它们显然会与表的大小成比例地影响查询速度。100 行的全表扫描甚至不明显。对包含 1 亿行的表运行相同的查询,您将需要在下周返回以获取返回值。
Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.
让我们先谈谈规范化。这是另一个可能过分强调的积极的学术话题。大多数时候,当我们谈论规范化时,我们真正的意思是通过将重复数据放入自己的表中并迁移 FK 来消除重复数据。人们通常会跳过 2NF 和 3NF 描述的整个依赖关系。然而,在极端情况下,当然有可能拥有一个完美的 BCNF 数据库,因为它是如此规范化,因此它是一个庞大而完整的野兽来编写代码。
So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.
那么我们在哪里平衡呢?没有唯一的最佳答案。所有更好的答案都倾向于在易于结构维护、易于数据维护和易于代码创建/维护之间进行某种折衷。一般来说,重复数据越少越好。
So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.
那么为什么连接有时很慢呢?有时这是糟糕的关系设计。有时它是无效的索引。有时是数据量问题。有时这是一个可怕的书面查询。
Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.
很抱歉回答如此冗长,但我觉得有必要在我的评论中提供更详细的背景信息,而不是仅仅给出 4 个子弹的回应。
回答by HLGEM
People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.
拥有 TB 大小数据库的人仍然使用连接,如果他们能让他们在性能方面工作,那么你也可以。
There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.
不进行非标准化的原因有很多。首先,选择查询的速度不是数据库的唯一关注点,甚至不是主要关注点。数据的完整性是首要问题。如果您非规范化,那么您必须采用技术来在父数据更改时保持数据非规范化。因此,假设您将客户端名称存储在所有表中,而不是在 client_Id 上加入客户端表。现在,当客户的名称发生变化时(100% 的客户名称会随着时间的推移而发生变化),现在您需要更新所有子记录以反映该变化。如果您在级联更新中执行此操作,并且您有 100 万个子记录,您认为这会有多快,以及有多少用户会在发生锁定问题和工作延迟时遇到问题?此外,大多数非规范化的人因为“
Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.
非规范化是一个复杂的过程,如果要正确完成,需要对数据库性能和完整性有透彻的了解。不要试图去规范化,除非你对员工有这样的专业知识。
Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.
如果你做几件事,联接就足够快了。首先使用 suggorgate 键,int join 几乎是最快的 join。其次总是索引外键。使用派生表或连接条件创建一个较小的数据集进行过滤。如果您有一个非常复杂的大型数据库,那么请聘请具有分区和管理大型数据库经验的专业数据库人员。有很多技术可以在不摆脱连接的情况下提高性能。
If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.
如果您只需要查询功能,那么是的,您可以设计一个数据仓库,该数据仓库可以非规范化并通过 ETL 工具(针对速度进行了优化)而不是用户数据输入进行填充。
回答by Paul Sasik
Joins are slow if
联接很慢,如果
- the data is improperly indexed
- results poorly filtered
- joining query poorly written
- data sets very large and complex
- 数据索引不正确
- 结果过滤不善
- 加入查询写得不好
- 数据集非常庞大和复杂
So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.
因此,确实,数据集越大,查询所需的处理就越多,但检查和处理上述前三个选项通常会产生很好的结果。
Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.
您的来源提供了非规范化作为选项。只要您已经用尽了更好的替代方案,这就没问题。
回答by Quassnoi
The joins can be slow if large portions of records from each side need to be scanned.
如果需要扫描每一侧的大部分记录,连接可能会很慢。
Like this:
像这样:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
Even if an index is defined on account_customer
, all records from the latter still need to be scanned.
即使在 上定义了索引account_customer
,仍然需要扫描来自后者的所有记录。
For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN
or a MERGE JOIN
instead.
对于查询列表这个,体面的优化器甚至可能不会考虑索引访问路径,而是使用aHASH JOIN
或 a MERGE JOIN
。
Note that for a query like this:
请注意,对于这样的查询:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
WHERE customer_last_name = 'Stellphlug'
the join will most probably will be fast: first, an index on customer_last_name
will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer
will be issued for each Stellphlug to find his transactions.
连接很可能会很快:首先,customer_last_name
将使用索引来过滤所有 Stellphlug(当然,数量不是很多),然后account_customer
将为每个 Stellphlug 发出索引扫描以查找他的事务。
Despite the fact that these can be billions of records in accounts
and customers
, only few will actually need to be scanned.
尽管这些可能是数十亿条记录中accounts
和customers
,只有少数实际上需要进行扫描。
回答by Marcus Adams
Joins are fast.
Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.
Joins are fast.
联接应被视为具有正确规范化数据库模式的标准做法。联接允许您以有意义的方式联接不同的数据组。不要害怕加入。
The caveat is that you must understand normalization, joining, and the proper use of indexes.
需要注意的是,您必须了解规范化、连接和索引的正确使用。
Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.
当心过早优化,因为所有开发项目的第一个失败就是赶上最后期限。一旦你完成了项目,并且你理解了权衡,如果你能证明它的合理性,你就可以打破规则。
It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.
确实,随着数据集大小的增加,连接性能会非线性地降低。因此,它的伸缩性不如单表查询好,但它仍然可以伸缩。
It's also true that a bird flies faster without any wings, but only straight down.
鸟没有翅膀飞得更快,但只能直飞,这也是事实。
回答by Larry Lustig
Also from the article you cited:
同样来自您引用的文章:
Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier.
许多拥有数十亿条记录、数 PB 数据、数千名并发用户以及每天数百万次查询的大型网站正在使用分片方案,有些甚至提倡将非规范化作为构建数据层的最佳策略。
and
和
And unless you are a really large website you probably don't need to worry about this level of complexity.
除非您是一个非常大的网站,否则您可能不需要担心这种复杂程度。
and
和
It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle.
与让数据库完成所有这些工作相比,它更容易出错,但是您可以进行扩展,即使是最高端的数据库也可以处理。
The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.
这篇文章正在讨论像 Ebay 这样的大型网站。在这种使用级别上,您可能不得不考虑普通关系数据库管理以外的其他内容。但是在业务的“正常”过程中(具有数千个用户和数百万条记录的应用程序),那些更昂贵、更容易出错的方法是矫枉过正的。
回答by Brent Baisley
Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.
连接确实需要额外的处理,因为它们必须查看更多文件和更多索引以将数据“连接”在一起。然而,“非常大的数据集”都是相对的。big的解释是什么?对于 JOIN,我认为它是对大型结果集的引用,而不是对整个数据集的引用。
Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.
大多数数据库可以非常快速地处理从主表中选择 5 条记录并为每条记录连接来自相关表的 5 条记录的查询(假设有正确的索引)。这些表每个可以有数亿甚至数十亿条记录。
Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.
一旦你的结果集开始增长,事情就会变慢。使用相同的示例,如果主表产生 100K 条记录,则需要找到 500K 条“连接”记录。只是从数据库中提取那么多数据,并增加延迟。
Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".
不要避免 JOIN,只要知道当数据集变得“非常大”时您可能需要优化/非规范化。