MySQL 用mysql处理非常大的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39700330/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling very large data with mysql
提问by mOna
Sorry for the long post!
抱歉,帖子太长了!
I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and I don't have problem with the current database size.
我有一个包含 ~30 个表(InnoDB 引擎)的数据库。其中只有两个表,即“transaction”和“shift”非常大(第一个有 150 万行,而 shift 有 23k 行)。现在一切正常,我对当前数据库大小没有问题。
However, we will have a similar database (same datatypes, design ,..) but much larger, e.g., the "transaction" table will have about 1 billion records(about 2,3 million transaction per day) and we are thinking about how we should deal with such volume of data in MySQL? (it is both read and write intensive). I read a lot of related posts to see if Mysql (and more specifically InnoDB engine) can perform well with billions of records, but still I have some questions. Some of those related posts that I've read are in the following:
然而,我们将有一个类似的数据库(相同的数据类型,设计,..)但更大,例如,“事务”表将有大约10 亿条记录(每天大约230万笔交易),我们正在考虑如何我们应该在 MySQL 中处理这么大的数据量吗?(它是读写密集型的)。我阅读了很多相关的帖子,看看 Mysql(更具体地说是 InnoDB 引擎)是否可以在数十亿条记录上表现良好,但我仍然有一些问题。我读过的一些相关帖子如下:
- Can MySQL reasonably perform queries on billions of rows?
- Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?
- Best data store for billions of rows
- How big can a MySQL database get before performance starts to degrade
- Why MySQL could be slow with large tables?
- Can Mysql handle tables which will hold about 300 million records?
- MySQL 能否合理地对数十亿行执行查询?
- InnoDB (MySQL 5.5.8) 是数十亿行的正确选择吗?
- 数十亿行的最佳数据存储
- 在性能开始下降之前 MySQL 数据库可以有多大
- 为什么 MySQL 在处理大表时可能会很慢?
- Mysql 能处理大约 3 亿条记录的表吗?
What I've understood so far to improve the performance for very large tables:
到目前为止,我所了解的可以提高非常大表的性能:
- (for innoDB tables which is my case) increasing the
innodb_buffer_pool_size
(e.g., up to 80% of RAM). Also, I found some other MySQL performance tunning settings here in percona blog - having proper indexes on the table (using EXPLAN on queries)
- partitioning the table
- MySQL Sharding or clustering
- (对于 innoDB 表,这是我的情况)增加
innodb_buffer_pool_size
(例如,高达 80% 的 RAM)。另外,我在 percona 博客中找到了其他一些 MySQL 性能调整设置 - 在表上有适当的索引(在查询中使用 EXPLAN)
- 对表进行分区
- MySQL 分片或集群
Here are my questions/confusions:
这是我的问题/困惑:
About partitioning, I have some doubts whether we should use it or not. On one hand many people suggested it to improve performance when table is very large. On the other hand, I've read many posts saying it does not improve query performance and it does not make queries run faster (e.g., hereand here). Also, I read in MySQL Reference Manualthat InnoDB foreign keys and MySQL partitioning are not compatible(we have foreign keys).
Regarding indexes, right now they perform well, but as far as I understood, for very large tables indexing is more restrictive (as Kevin Bedell mentioned in his answer here). Also, indexes speed up reads while slow down write (insert/update). So, for the new similar project that we will have this large DB, should we first insert/load all the data and then create indexes? (to speed up the insert)
If we cannot use partitioning for our big table ("transaction" table), what is an alternative option to improve the performance? (except MySQl variable settings such as
innodb_buffer_pool_size
). Should we use Mysql clusters? (we have also lots of joins)
关于分区,我有一些疑问,我们是否应该使用它。一方面,当表非常大时,许多人建议使用它来提高性能。另一方面,我读过很多帖子说它不会提高查询性能,也不会使查询运行得更快(例如,here和here)。另外,我在MySQL 参考手册中读到InnoDB 外键和 MySQL 分区不兼容(我们有外键)。
关于指标,现在他们表现良好,但据我了解,对于非常大的表的索引是更严格(如凯文·比德尔在他的回答中提到在这里)。此外,索引加快读取速度,同时减慢写入速度(插入/更新)。那么,对于我们将拥有这个大数据库的新类似项目,我们是否应该先插入/加载所有数据,然后创建索引?(加快插入速度)
如果我们不能对我们的大表(“事务”表)使用分区,那么有什么替代方案可以提高性能?(除了 MySQl 变量设置,例如
innodb_buffer_pool_size
)。我们应该使用Mysql集群吗?(我们也有很多连接)
EDIT
编辑
This is the show create table
statement for our largest table named "transaction":
这是show create table
我们最大的名为“transaction”的表的语句:
CREATE TABLE `transaction` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`terminal_transaction_id` int(11) NOT NULL,
`fuel_terminal_id` int(11) NOT NULL,
`fuel_terminal_serial` int(11) NOT NULL,
`xboard_id` int(11) NOT NULL,
`gas_station_id` int(11) NOT NULL,
`operator_id` text NOT NULL,
`shift_id` int(11) NOT NULL,
`xboard_total_counter` int(11) NOT NULL,
`fuel_type` int(11) NOT NULL,
`start_fuel_time` int(11) NOT NULL,
`end_fuel_time` int(11) DEFAULT NULL,
`preset_amount` int(11) NOT NULL,
`actual_amount` int(11) DEFAULT NULL,
`fuel_cost` int(11) DEFAULT NULL,
`payment_cost` int(11) DEFAULT NULL,
`purchase_type` int(11) NOT NULL,
`payment_ref_id` text,
`unit_fuel_price` int(11) NOT NULL,
`fuel_status_id` int(11) DEFAULT NULL,
`fuel_mode_id` int(11) NOT NULL,
`payment_result` int(11) NOT NULL,
`card_pan` text,
`state` int(11) DEFAULT NULL,
`totalizer` int(11) NOT NULL DEFAULT '0',
`shift_start_time` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `terminal_transaction_id` (`terminal_transaction_id`,`fuel_terminal_id`,`start_fuel_time`) USING BTREE,
KEY `start_fuel_time_idx` (`start_fuel_time`),
KEY `fuel_terminal_idx` (`fuel_terminal_id`),
KEY `xboard_idx` (`xboard_id`),
KEY `gas_station_id` (`gas_station_id`) USING BTREE,
KEY `purchase_type` (`purchase_type`) USING BTREE,
KEY `shift_start_time` (`shift_start_time`) USING BTREE,
KEY `fuel_type` (`fuel_type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1665335 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
Thanks for your time,
谢谢你的时间,
回答by Rick James
Can MySQL reasonably perform queries on billions of rows? -- MySQL can 'handle' billions of rows. "Reasonably" depends on the queries; let's see them.
Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows? -- 5.7 has some improvements, but 5.5 is pretty good, in spite of being
nearly 68 years old, and on the verge of no longer being supported.Best data store for billions of rows -- If you mean 'Engine', then InnoDB.
How big can a MySQL database get before performance starts to degrade -- Again, that depends on the queries. I can show you a 1K row table that will meltdown; I have worked with billion-row tables that hum along.
Why MySQL could be slow with large tables? -- range scans lead to I/O, which is the slow part.
Can Mysql handle tables which will hold about 300 million records? -- again, yes. The limit is somewhere around a trillion rows.
(for innoDB tables which is my case) increasing the innodb_buffer_pool_size (e.g., up to 80% of RAM). Also, I found some other MySQL performance tunning settings here in percona blog -- yes
having proper indexes on the table (using EXPLAN on queries) -- well, let's see them. There are lot of mistakes that can be made in this criticalarea.
partitioning the table -- "Partitioning is not a panacea!" I harp on that in my blog
MySQL Sharding -- Currently this is DIY
MySQL clustering -- Currently the best answer is some Galera-based option (PXC, MariaDB 10, DIY w/Oracle). Oracle's "Group Replication" is a viable contender.
Partitioning does not support
FOREIGN KEY
or "global"UNIQUE
.UUIDs, at the scale you are talking about, will not just slow down the system, but actually kill it. Type 1 UUIDsmay be a workaround.
Insert and index-build speed -- There are too many variations to give a single answer. Let's see your tentative
CREATE TABLE
and how you intend to feed the data in.Lots of joins -- "Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.
Do build summary tables
2,3 million transaction per day -- If that is 2.3M inserts(30/sec), then there is not much of a performance problem. If more complex, then RAID, SSD, batching, etc, may be necessary.
deal with such volume of data -- If most activity is with the "recent" rows, then the buffer_pool will nicely 'cache' the activity, thereby avoiding I/O. If the activity is "random", then MySQL (or anyoneelse) will have I/O issues.
Shrinking the datatypes helps in a table like yours. I doubt if you need 4 bytes to specify
fuel_type
. There are multiple 1-byte approaches.
MySQL 能否合理地对数十亿行执行查询?-- MySQL 可以“处理”数十亿行。“合理”取决于查询;让我们看看他们。
InnoDB (MySQL 5.5.8) 是数十亿行的正确选择吗?-- 5.7有一些改进,但是5.5还是很不错的,虽然已经
快6年8岁了,已经快不支持了。数十亿行的最佳数据存储——如果你的意思是“引擎”,那么 InnoDB。
在性能开始下降之前,MySQL 数据库可以有多大——同样,这取决于查询。我可以给你看一个会崩溃的 1K 行表;我曾与嗡嗡作响的十亿行表一起工作。
为什么 MySQL 在处理大表时可能会很慢?-- 范围扫描导致 I/O,这是较慢的部分。
Mysql 能处理大约 3 亿条记录的表吗?——再次,是的。限制在大约一万亿行。
(对于 innoDB 表,这是我的情况)增加 innodb_buffer_pool_size (例如,高达 80% 的 RAM)。另外,我在 percona 博客中找到了其他一些 MySQL 性能调整设置——是的
在表上有适当的索引(在查询中使用 EXPLAN)——好吧,让我们看看它们。在这个关键领域可能会犯很多错误。
分区表——“分区不是万能的!” 我在我的博客中强调这一点
MySQL 分片——目前这是 DIY
MySQL 集群——目前最好的答案是一些基于 Galera 的选项(PXC、MariaDB 10、DIY w/Oracle)。Oracle 的“组复制”是一个可行的竞争者。
分区不支持
FOREIGN KEY
或“全局”UNIQUE
。UUID,在你所说的规模上,不仅会减慢系统的速度,而且实际上会杀死它。 类型 1 UUID可能是一种解决方法。
插入和索引构建速度——变化太多,无法给出一个答案。让我们看看您的尝试
CREATE TABLE
以及您打算如何输入数据。许多连接——“规范化,但不要过度规范化。” 特别是,不要标准化日期时间或浮点数或其他“连续”值。
建立汇总表
每天 230 万个事务——如果那是 230 万次插入(30/秒),那么性能问题就不大了。如果更复杂,则可能需要 RAID、SSD、批处理等。
处理如此大量的数据——如果大多数活动与“最近”行有关,那么 buffer_pool 将很好地“缓存”活动,从而避免 I/O。如果活动是“随机的”,那么 MySQL(或其他任何人)都会有 I/O 问题。
缩小数据类型有助于像您这样的表。我怀疑您是否需要 4 个字节来指定
fuel_type
. 有多种 1 字节方法。
回答by Rick James
When collecting billions of rows, it is better (when possible) to consolidate, process, summarize, whatever, the data beforestoring. Keep the raw data in a file if you think you need to get back to it.
在收集数十亿行时,最好(在可能的情况下)在存储之前合并、处理、汇总数据。如果您认为需要返回原始数据,请将原始数据保存在文件中。
Doing that will eliminate most of your questions and concerns, plus speed up the processing.
这样做将消除您的大部分问题和疑虑,并加快处理速度。