MySQL 数据库分片与分区
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20771435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Database sharding vs partitioning
提问by Amit Sharma
I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are shardingand partitioning. I looked up descriptions but still ended up confused.
我最近一直在阅读有关可扩展架构的信息。在这种情况下,关于数据库不断出现的两个词是分片和分区。我查看了描述,但最终还是很困惑。
Could the experts at stackoverflow help me get the basics right?
stackoverflow 的专家可以帮助我掌握基础知识吗?
- What is the difference between shardingand partitioning?
- Is it true that 'all sharded databases are essentially partitioned (over different nodes), but all partitioned databases are not necessarily sharded'?
- 分片和分区有什么区别?
- “所有分片数据库本质上都是分区的(在不同的节点上),但所有分区的数据库都不一定是分片的”,这是真的吗?
采纳答案by Canis
Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.
分区更像是跨表或数据库划分数据的通用术语。分片是一种特定类型的分区,是所谓的水平分区的一部分。
Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".
在这里,您跨(通常)多个实例或服务器复制模式,使用某种逻辑或标识符来了解要查找数据的实例或服务器。这种标识符通常称为“分片键”。
A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.
一个常见的无键逻辑是使用字母表来划分数据。AD 是实例 1,EG 是实例 2 等等。客户数据非常适合于此,但如果分区没有考虑到某些字母比其他字母更常见,则跨实例的大小会有些许误传。
Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.
另一种常用技术是使用密钥同步系统或逻辑来确保跨实例的唯一密钥。
A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs
您可以研究的一个众所周知的例子是 Instagram 在早期是如何解决他们的分区问题的(见下面的链接)。他们开始在很少的服务器上进行分区,从一开始就使用 Postgres 来划分数据。我相信这几个物理分片上有几千个逻辑分片。在此处阅读他们 2012 年的精彩文章:Instagram 工程 - 分片和 ID
See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition
也可以在这里查看:http: //www.quora.com/Whats-the-difference-between-sharding-and-partition
回答by NoChance
Looks like this answers both your questions:
看起来这回答了你的两个问题:
Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
水平分区按行拆分一个或多个表,通常在模式和数据库服务器的单个实例中。它可以通过减少索引大小(从而减少搜索工作量)来提供优势,前提是有一些明显的、健壮的、隐式的方式来识别特定行将在哪个表中找到,而无需首先搜索索引,例如,经典的'CustomersEast' 和 'CustomersWest' 表的示例,其中的邮政编码已经表明可以找到它们的位置。
分片超越了这一点:它以相同的方式对有问题的表进行分区,但它可能跨架构的多个实例进行分区。明显的优势是大型分区表的搜索负载现在可以跨多个服务器(逻辑或物理)拆分,而不仅仅是同一逻辑服务器上的多个索引。
Source:Wiki-Shard.
来源:维基碎片。
Sharding is the process of storing data records across multiple machines and is MongoDB's approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.
分片是跨多台机器存储数据记录的过程,是 MongoDB 满足数据增长需求的方法。随着数据大小的增加,单台机器可能不足以存储数据,也无法提供可接受的读写吞吐量。分片解决了水平扩展的问题。通过分片,您可以添加更多机器来支持数据增长和读写操作的需求。
Source: MongoDB.
资料来源:MongoDB。
回答by Jeach
I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:
我也一直在深入研究这个问题,虽然到目前为止我是这个问题的参考,但我收集到的关键事实很少,我想分享一些要点:
A partitionis a division of a logical database or its constituent elements into distinct independent parts. Database partitioningis normally done for manageability, performance or availability reasons, as for load balancing.
甲分区是一个逻辑数据库或它的组成元件分成不同的部分独立的一个部门。数据库分区通常是出于可管理性、性能或可用性原因而进行的,例如负载平衡。
https://en.wikipedia.org/wiki/Partition_(database)
https://en.wikipedia.org/wiki/Partition_(数据库)
Shardingis a type of partitioning, such as Horizontal Partitioning(HP)
分片是一种分区,例如水平分区(HP)
There is also Vertical Partitioning(VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.
还有垂直分区(VP),您可以将表拆分为更小的不同部分。规范化还涉及跨表的列拆分,但垂直分区超出了这一范围,即使已经规范化,也会对列进行分区。
https://en.wikipedia.org/wiki/Shard_(database_architecture)
https://en.wikipedia.org/wiki/Shard_(database_architecture)
I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...
我真的很喜欢托尼·巴科 (Tony Baco) 在 Quora 上的回答,他让您根据模式(而不是列和行)进行思考。他表示...
"Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.
“水平分区”或分片是复制[copying] schema,然后根据分片键对数据进行划分。
"Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).
“垂直分区”涉及划分模式(数据随之而来)。
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.
Oracle 的数据库分区指南有一些不错的数字。我从文章中复制了一些摘录。
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
When to Partition a Table
何时对表进行分区
Here are some suggestions for when to partition a table:
以下是有关何时对表进行分区的一些建议:
- Tables greater than 2 GB should always be considered as candidates for partitioning.
- Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
- When the contents of a table need to be distributed across different types of storage devices.
- 应始终将大于 2 GB 的表视为分区的候选对象。
- 包含历史数据的表,其中新数据被添加到最新的分区中。一个典型的例子是一个历史表,其中只有当月的数据是可更新的,其他 11 个月的数据是只读的。
- 当表的内容需要分布在不同类型的存储设备上时。
Partition Pruning
分区修剪
Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.
分区修剪是使用分区提高性能的最简单也是最重要的手段。分区修剪通常可以将查询性能提高几个数量级。例如,假设一个应用程序包含一个包含订单历史记录的 Orders 表,并且该表已按周分区。请求单周订单的查询只会访问 Orders 表的一个分区。如果 Orders 表有 2 年的历史数据,那么这个查询将访问一个分区而不是 104 个分区。由于分区修剪,此查询的执行速度可能快 100 倍。
Partitioning Strategies
分区策略
- Range
- Hash
- List
- 范围
- 哈希
- 列表
You can read their text and visualize their images which explain everything pretty well.
你可以阅读他们的文字并可视化他们的图像,这很好地解释了一切。
And lastly, it is important to understand that databases are extremely resource intensive:
最后,重要的是要了解数据库是极其占用资源的:
- CPU
- Disk
- I/O
- Memory
- 中央处理器
- 磁盘
- 输入/输出
- 记忆
Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.
许多 DBA 将在同一台机器上进行分区,这些分区将共享所有资源,但通过拆分数据和/或索引来改进磁盘和 I/O。
While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.
而其他策略将采用“无共享”架构,其中分片将驻留在单独且不同的计算单元(节点)上,并拥有 100% 的 CPU、磁盘、I/O 和内存。提供它自己的一系列优势和复杂性。
回答by vikas bhandari
Consider a Table in database with 1 Million rows and 100 columns In Partitioningyou can divide the table into 2 or more table having property like:
考虑数据库中具有 100 万行和 100 列的表在分区中,您可以将表划分为 2 个或多个具有如下属性的表:
0.4 Million rows(table1), 0.6 million rows(table2)
1 Million rows & 60 columns(table1) and 1 Million rows & 40 columns(table2)
There could be multiple cases like that
40 万行(表 1),60 万行(表 2)
100 万行 60 列(表 1)和 100 万行 40 列(表 2)
可能有多个这样的情况
This is general partitioning
这是一般分区
But Shardingrefer to 1st case only where we are dividing the data on the basis of rows. If we are dividing the table into multiple table we need to maintain multiple similar copies of schemas as now we have multiple tables.
但是,拆分是指第一种情况下只有当我们将行的基础上的数据。如果我们将表分成多个表,我们需要维护多个相似的模式副本,因为现在我们有多个表。
回答by Pavel
Shardingin a special case of horizontal partitioning, when partitions spans across multiple database instances. If a database is sharded, it means that it's partitioned by definition.
拆分中的一个特例水平分区,当跨越多个数据库实例分区跨度。如果数据库是分片的,则意味着它根据定义进行了分区。
回答by imran
When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page. When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute. In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all. Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.
在谈论分区时,请不要使用术语复制或复制。复制是一个不同的概念,超出了本页的范围。当我们谈论分区时,更好的词是划分,而当我们谈论分片时,更好的词是分布。在分区中(通常和一般理解并不总是),大数据集表的行被分成两个或多个不相交(不共享任何行)的组。您可以将每个组称为一个分区。这些组或所有分区仍然在一个 RDMB 实例的控制之下,这都是合乎逻辑的。每个组的基数可以是散列或范围等。如果表中有十年数据,那么您可以将每一年的数据存储在单独的分区中,这可以通过设置分区边界来实现非空列 CREATE_DATE。查询数据库后,如果您指定介于 01-01-1999 和 31-12-2000 之间的创建日期,则只会命中两个分区,并且将是连续的。我在 DB 上做了类似的操作,记录了十亿多条记录,使用索引等所有 sql 时间从 30 秒缩短到 50 毫秒。分片是将每个分区托管在不同的节点/机器上。现在在分区/分片内搜索可以并行进行。