database 什么是分片,为什么它很重要?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/992988/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is sharding and why is it important?
提问by ojblass
I think I understand sharding to be putting back your sliced up data (the shards) into an easy to deal with aggregate that makes sense in the context. Is this correct?
我想我理解分片是将你的切片数据(分片)放回一个易于处理的聚合中,这在上下文中是有意义的。这样对吗?
Update: I guess I am struggling here. In my opinion the application tier should have no business determining where data should be stored. At best it should be shard client of some sort. Both responses answered the what but not the why is it important aspect. What implications does it have outside of the obvious performance gains? Are these gains sufficient to offset the MVC violation? Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?
更新:我想我在这里挣扎。在我看来,应用层不应该有决定数据应该存储在哪里的业务。充其量它应该是某种分片客户端。两个回答都回答了什么,但没有回答为什么它很重要。除了明显的性能提升之外,它还有什么影响?这些收益是否足以抵消 MVC 违规?分片在超大规模应用程序中最重要还是适用于较小规模的应用程序?
采纳答案by MicSim
Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.
分片只是数据库“水平分区”的另一个名称。您可能想搜索该术语以使其更清楚。
From Wikipedia:
来自维基百科:
Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.
水平分区是一种设计原则,其中数据库表的行分开保存,而不是按列拆分(如规范化)。每个分区构成分片的一部分,分片又可能位于单独的数据库服务器或物理位置。优点是减少了每个表中的行数(这减少了索引大小,从而提高了搜索性能)。如果分片基于数据的某些现实方面(例如欧洲客户与美国客户),则可以轻松自动地推断适当的分片成员资格,并仅查询相关分片。
Some more information about sharding:
有关分片的更多信息:
Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database. You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.
首先,每个数据库服务器都是相同的,具有相同的表结构。其次,数据记录在一个分片数据库中被逻辑分割。与分区数据库不同,每个完整的数据记录只存在于一个分片中(除非有备份/冗余的镜像),所有 CRUD 操作都只在该数据库中执行。您可能不喜欢所使用的术语,但这确实代表了将逻辑数据库组织成更小的部分的不同方式。
Update:You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.
更新:您不会破坏 MVC。确定存储数据的正确分片的工作将由您的数据访问层透明地完成。在那里,您必须根据用于对数据库进行分片的标准来确定正确的分片。(因为您必须根据应用程序的某些具体方面手动将数据库分片到一些不同的分片中。)然后在从/向数据库加载和存储数据以使用正确的分片时必须小心。
Maybe this examplewith Java code makes it somewhat clearer (it's about the Hibernate Shardsproject), how this would work in a real world scenario.
也许这个带有 Java 代码的例子让它更清晰一些(它是关于Hibernate Shards项目的),这在现实世界的场景中是如何工作的。
To address the "why sharding": It's mainly only for very large scale applications, with lotsof data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.
解决“ why sharding”:它主要仅适用于具有大量数据的超大规模应用程序。首先,它有助于最大限度地减少数据库查询的响应时间。其次,您可以使用更便宜、“低端”的机器来托管您的数据,而不是一台可能已经不够用的大型服务器。
回答by bayer
If you have queries to a DBMS for which the locality is quite restricted (say, a user only fires selects with a 'where username = $my_username') it makes sense to put all the usernames starting with A-M on one server and all from N-Z on the other. By this you get near linear scaling for some queries.
如果您对位置非常受限的 DBMS 有查询(例如,用户仅使用“where username = $my_username”触发选择),则将所有以 AM 开头的用户名放在一台服务器上并且全部来自 NZ 是有意义的在另一。通过这种方式,您可以对某些查询进行接近线性缩放。
Long story short: Sharding is basically the process of distributing tables onto different servers in order to balance the load onto both equally.
长话短说:分片基本上是将表分配到不同服务器上的过程,以平衡两者的负载。
Of course, it's so much more complicated in reality. :)
当然,实际情况要复杂得多。:)
回答by Himanshu Kansal
Sharding is horizontal(row wise) database partitioning as opposed to vertical(column wise) partitioning which is Normalization. It separates very large databases into smaller, faster and more easily managed parts called data shards. It is a mechanism to achieve distributed systems.
分片是水平(行方式)数据库分区,而不是垂直(列方式)分区,即标准化。它将非常大的数据库分成更小、更快且更易于管理的部分,称为数据分片。它是一种实现分布式系统的机制。
Why do we need distributed systems?
为什么我们需要分布式系统?
- Increased availablity.
- Easier expansion.
- Economics: It costs less to create a network of smaller computers with the power of single large computer.
- 增加了可用性。
- 更容易扩展。
- 经济学:用一台大型计算机的能力创建一个由小型计算机组成的网络,成本更低。
You can read more here: Advantages of Distributed database
您可以在此处阅读更多信息:分布式数据库的优势
How sharding help achieve distributed system?
分片如何帮助实现分布式系统?
You can partition a search index into N partitions and load each index on a separate server. If you query one server, you will get 1/Nth of the results. So to get complete result set, a typical distributed search system use an aggregatorthat will accumulate results from each server and combine them. An aggregator also distribute query onto each server. This aggregator program is called MapReducein big data terminology. In other words, Distributed Systems = Sharding + MapReduce (Although there are other things too).
您可以将搜索索引划分为 N 个分区,并将每个索引加载到单独的服务器上。如果您查询一台服务器,您将获得结果的 1/N。因此,为了获得完整的结果集,典型的分布式搜索系统使用一个聚合器,该聚合器将从每个服务器收集结果并将它们组合起来。聚合器还将查询分发到每个服务器上。这个聚合器程序在大数据术语中称为MapReduce。换句话说,分布式系统 = 分片 + MapReduce(虽然还有其他的东西)。
回答by earino
Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?
分片在超大规模应用程序中最重要还是适用于较小规模的应用程序?
Sharding is a concern if and only if your needs scale past what can be served by a single database server. It's a swell tool if you have shardable data and you have incredibly high scalability and performance requirements. I would guess that in my entire 12 years I've been a software professional, I've encountered one situation that could have benefited from sharding. It's an advanced technique with very limited applicability.
当且仅当您的需求超出单个数据库服务器所能提供的范围时,分片才是一个问题。如果您拥有可分片的数据并且具有极高的可扩展性和性能要求,那么它是一个很好的工具。我猜想,在我做软件专家的整整 12 年里,我遇到过一种可以从分片中受益的情况。这是一种适用性非常有限的先进技术。
Besides, the future is probably going to be something fun and exciting like a massive object "cloud" that erases all potential performance limitations, right? :)
此外,未来可能会变得有趣和令人兴奋,就像一个巨大的对象“云”,它消除了所有潜在的性能限制,对吧?:)
回答by lampShaded
Sharding was originally coined by google engineers and you can see it used pretty heavily when writing applications on Google App Engine. Since there are hard limitations on the amount of resource your queries can use and because queries themselves have strict limitations, sharding is not only encouraged but almost enforced by the architecture.
分片最初是由谷歌工程师创造的,你可以看到它在谷歌 App Engine 上编写应用程序时被大量使用。由于您的查询可以使用的资源量存在严格限制,而且查询本身也有严格限制,因此架构不仅鼓励而且几乎强制执行分片。
Another place sharding can be used is to reduce contention on data entities. It is especially important when building scalable systems to watch out for those piece of data that are written often because they are always the bottleneck. A good solution is to shard off that specific entity and write to multile copies, then read the total. An example of this "sharded counter wrt GAE: http://code.google.com/appengine/articles/sharding_counters.html
可以使用分片的另一个地方是减少对数据实体的争用。在构建可扩展系统时注意那些经常写入的数据尤其重要,因为它们始终是瓶颈。一个好的解决方案是将特定实体分片并写入多个副本,然后读取总数。此“分片计数器与 GAE 的示例:http: //code.google.com/appengine/articles/sharding_counters.html
回答by Krishna Rathi
Sharding does more than just horizontal partitioning. According to the wikipedia article,
分片不仅仅是水平分区。根据维基百科文章,
Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
水平分区按行拆分一个或多个表,通常在模式和数据库服务器的单个实例中。它可以通过减少索引大小(从而减少搜索工作量)来提供优势,前提是有一些明显的、健壮的、隐式的方式来识别特定行将在哪个分区中找到,而无需首先搜索索引,例如,经典的'CustomersEast' 和 'CustomersWest' 表的示例,其中的邮政编码已经表明可以找到它们的位置。
分片超越了这一点:它以相同的方式对有问题的表进行分区,但它可能跨架构的多个实例进行分区。明显的优势是大型分区表的搜索负载现在可以跨多个服务器(逻辑或物理)拆分,而不仅仅是同一逻辑服务器上的多个索引。
Also,
还,
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units
跨多个隔离实例拆分分片需要的不仅仅是简单的水平分区。如果查询数据库需要查询两个实例,只是为了检索一个简单的维度表,那么所希望的效率收益就会丧失。除了分区之外,分片因此在服务器之间拆分大型可分区表,而较小的表则作为完整单元进行复制
回答by Hans Malherbe
In my opinion the application tier should have no business determining where data should be stored
在我看来,应用层不应该有决定数据应该存储在哪里的业务
This is a good rule but like most things not always correct.
这是一个很好的规则,但像大多数事情一样并不总是正确的。
When you do your architecture you start with responsibilities and collaborations. Once you determine your functional architecture, you have to balance the non-functional forces.
当你做你的架构时,你从责任和协作开始。一旦确定了功能架构,就必须平衡非功能性力量。
If one of these non-functional forces is massive scalability, you have to adapt your architecture to cater for this force even if it means that your data storage abstraction now leaks into your application tier.
如果这些非功能性力量之一是巨大的可扩展性,则您必须调整架构以适应这种力量,即使这意味着您的数据存储抽象现在泄漏到您的应用程序层。


