MySQL 分片方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5541421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 19:24:13  来源:igfitidea点击:

MySQL sharding approaches?

mysqlsharding

提问by sheki

What is the best approach for Sharding MySQL tables. The approaches I can think of are :

分片 MySQL 表的最佳方法是什么。我能想到的方法是:

  1. Application Level sharding?
  2. Sharding at MySQL proxy layer?
  3. Central lookup server for sharding?
  1. 应用级分片?
  2. 在 MySQL 代理层分片?
  3. 用于分片的中央查找服务器?

Do you know of any interesting projects or tools in this area?

你知道这方面有什么有趣的项目或工具吗?

回答by Isotopp

The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.

除非完全不可避免,否则最好不要将 MySQL 表分片。

When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.

在编写应用程序时,您通常希望以最大限度提高速度和开发人员速度的方式进行编写。仅在必要时优化延迟(答案准备好之前的时间)或吞吐量(每个时间单位的答案数)。

You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.

只有当所有这些分区的总和不再适合单个数据库服务器实例时,您才进行分区,然后将分区分配给不同的主机(= 分片) - 原因是写入或读取。

The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.

写入情况是 a) 写入频率使服务器磁盘永久过载或 b) 写入过多,因此复制在此复制层次结构中永久滞后。

The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.

分片的读取情况是当数据的大小如此之大以至于它的工作集不再适合内存并且数据读取开始命中磁盘而不是大部分时间从内存中提供服务时。

Only when you haveto shard you do it.

只有当你必须分片时,你才这样做。



The moment you shard, you are paying for that in multiple ways:

分片的那一刻,您将通过多种方式为此付出代价:

Much of your SQL is no longer declarative.

您的大部分 SQL 不再是声明性的。

Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.

通常,在 SQL 中,您会告诉数据库您想要什么数据,然后让优化器将其转换为数据访问程序。这是一件好事,因为它很灵活,而且因为编写这些数据访问程序是一件会损害速度的无聊工作。

With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).

在分片环境中,您可能会根据节点 B 上的数据加入节点 A 上的表,或者您在节点 A 和 B 上有一个比节点大的表,并且正在根据节点 B 和 C 上的数据连接来自该节点的数据。您开始手动编写应用程序端基于哈希的联接解析以解决该问题(或者您正在重新发明 MySQL 集群),这意味着您最终会得到许多不再声明的 SQL,而是以程序方式表达 SQL 功能(例如,您在循环中使用 SELECT 语句)。

You are incurring a lot of network latency.

您正在招致大量网络延迟。

Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.

通常,SQL 查询可以在本地解析,优化器知道与本地磁盘访问相关的成本,并以最小化成本的方式解析查询。

In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHEREclause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.

在分片环境中,通过在网络上对多个节点运行键值访问来解决查询(希望使用批量键访问而不是每次往返的单独键查找)或通过将部分WHERE子句向前推送到节点应用(称为“条件下推”),或两者都应用。

But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).

但即使在最好的情况下,这也涉及比本地情况更多的网络往返行程,并且更复杂。特别是因为 MySQL 优化器对网络延迟一无所知(好吧,MySQL 集群在这方面正在慢慢变得更好,但对于集群外的 vanilla MySQL 仍然如此)。

You are losing a lot of expressive power of SQL.

您正在失去 SQL 的许多表达能力。

Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.

好吧,这可能不太重要,但是外键约束和其他用于数据完整性的 SQL 机制无法跨越多个分片。

MySQL has no API which allows asynchronous queries that is in working order.

MySQL 没有允许异步查询正常工作的 API。

When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").

当同一类型的数据驻留在多个节点上时(例如节点 A、B 和 C 上的用户数据),通常需要针对所有这些节点解决横向查询(“查找所有 90 天未登录的用户帐户或者更多”)。数据访问时间随节点数量线性增长,除非可以并行询问多个节点并在它们进来时聚合结果(“Map-Reduce”)。

The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.

这样做的前提是异步通信 API,它不存在于 MySQL 的良好工作状态。另一种选择是在子进程中进行大量分叉和连接,这是在季票上访问吮吸世界。



Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.

一旦您开始分片,数据结构和网络拓扑就会随着性能指向您的应用程序而变得可见。为了合理地执行,您的应用程序需要了解这些事情,这意味着实际上只有应用程序级分片才有意义。

The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").

如果您想自动分片(例如,通过散列主键来确定哪一行进入哪个节点),或者如果您想以手动方式进行功能拆分(“与 xyz 用户故事相关的表转到这个主,而 abc 和 def 相关的表去那个主")。

Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.

功能分片的优点是,如果做得好,大多数开发人员大部分时间都看不到它,因为与他们的用户故事相关的所有表都将在本地可用。这使他们仍然可以尽可能长时间地从声明式 SQL 中受益,并且由于跨网络传输的数量保持最少,因此网络延迟也会更少。

Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.

功能分片的缺点是不允许任何单个表大于一个实例,并且需要设计人员手动注意。

Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.comhas done it multiple times in the past years and it worked well for them.

功能分片的优势在于它相对容易地对现有代码库进行,但更改数量不会太大。http://Booking.com在过去几年中多次这样做,并且对他们来说效果很好。



Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.

说了这么多,看着你的问题,我确实相信你问错了问题,或者我完全误解了你的问题陈述。

回答by chantheman

  1. Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.

  2. Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.

  3. This is the same as #2 really.

  1. 应用程序级分片:dbShards 是我所知道的唯一一种“应用程序感知分片”的产品。网站上有几篇不错的文章。顾名思义,应用感知分片将更加高效。如果应用程序确切地知道事务的去向而无需查找它或由代理重定向,那么它本身会更快。当有人研究分片时,速度通常是主要关注点之一,如果不是唯一关注点的话。

  2. 有些人用代理“分片”,但在我看来这违背了分片的目的。您只是在使用另一台服务器来告诉您的交易在哪里可以找到数据或在哪里存储数据。使用应用程序感知分片,您的应用程序知道自己去哪里。效率更高。

  3. 这实际上与#2 相同。

回答by btcbb

Do you know of any interesting projects or tools in this area?

你知道这方面有什么有趣的项目或工具吗?

Several new projects in this space:

这个领域的几个新项目:

  • citusdata.com
  • spockproxy.sourceforge.net
  • github.com/twitter/gizzard/
  • citusdata.com
  • spockproxy.sourceforge.net
  • github.com/twitter/gizzard/

回答by Justin Swanhart

Shard-Queryis an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.

Shard-Query是一个基于 OLAP 的 MySQL 分片解决方案。它允许您定义分片表和非分片表的组合。未分片表(如查找表)可以自由连接到分片表,只要表通过分片键连接(没有跨分片边界的交叉分片或自连接),分片表就可以相互连接。作为 OLAP 解决方案,Shard-Query 通常具有 100 毫秒或更短的最小响应时间,即使对于简单的查询也是如此,因此它不适用于 OLTP。Shard-Query 旨在并行分析大数据集。

OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubridor Flock/Gizzard(Twitter infrastructure).

OLTP 分片解决方案也适用于 MySQL。闭源解决方案包括ScaleDBDBShards。开源 OLTP 解决方案包括JetPantsCubridFlock/ Gizzard(Twitter 基础架构)。

回答by Andrey Frolov

Application level of course.

当然是应用层面。

Best approach I've ever red I've found in this book

我在这本书中发现的最好的方法

High Performance MySQL http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064

高性能 MySQL http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064

Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)

简短描述:您可以将数据拆分为多个部分,并在每台服务器上存储约 50 个部分。它将帮助您避免分片的第二大问题——重新平衡。只需将其中一些移动到新服务器,一切都会好起来的:)

I strongly recommend you to buy it and read "mysql scaling" part.

我强烈建议您购买并阅读“mysql 扩展”部分。

回答by yuranos

As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Clusterand NDB Cluster(there is a commercial and a community version of it).

截至 2018 年,似乎有一个 MySql 原生解决方案。实际上至少有两个 - InnoDB ClusterNDB Cluster(有商业版和社区版)。

Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.

由于大多数使用 MySql 社区版的人更熟悉 InnoDB 引擎,因此这应该作为首要任务进行探索。它支持开箱即用的复制和分区/分片,并基于 MySql 路由器,用于不同的路由/负载平衡选项。

The syntax for your tables creation would need to change, for example:

创建表的语法需要更改,例如:

    CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );

(this is only one of four partitioning types)

(这只是四种分区类型之一

One very important limitation:

一个非常重要的限制:

InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.

InnoDB 外键和 MySQL 分区不兼容。分区的 InnoDB 表不能有外键引用,也不能有外键引用的列。具有外键或被外键引用的 InnoDB 表不能分区。