php MySQL 和 NoSQL:帮助我选择正确的
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4419499/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MySQL and NoSQL: Help me to choose the right one
提问by cedivad
There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)
有一个大数据库,有 1,000,000,000 行,称为线程(这些线程确实存在,我不会仅仅因为我喜欢它而使事情变得更难)。Threads 中只有一些东西,以加快速度:(int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)
Query:
询问:
select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100
select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100
Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:
由于有 1G 记录,因此查询速度很慢。所以我想,让我们把这 1G 的记录拆分到与我拥有的论坛(类别)一样多的表中!这几乎是完美的。有很多表,我可以搜索的记录较少,而且速度确实更快。查询现在变成:
select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100
select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100
This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).
这对于 99% 的论坛(类别)来说确实更快,因为其中大多数只有少数主题(100k-1M)。然而,因为有一些有大约 10M 的记录,一些查询仍然很慢(0.1/.2 秒,对我的应用程序来说太多了!,我已经在使用索引了!)。
I don't know how to improve this using MySQL. Is there a way?
我不知道如何使用 MySQL 改进这一点。有办法吗?
For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)
对于这个项目,我将使用 10 个服务器(12GB ram,软件 raid 10 上的 4x7200rpm 硬盘,四核)
The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.
这个想法是简单地在服务器之间拆分数据库,但是对于上面解释的问题,这还不够。
If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?
如果我在这 10 台服务器上安装 cassandra(假设我有时间让它按预期工作)我是否应该提升性能?
What should I do? Keep working with MySQL with distributed database on multiple machines or build a cassandra cluster?
我该怎么办?在多台机器上继续使用带有分布式数据库的 MySQL 还是构建一个 cassandra 集群?
I was asked to post what are the indexes, here they are:
我被要求发布什么是索引,它们是:
mysql> show index in thread;
PRIMARY id
forumid
dateline
replycount
Select explain:
选择说明:
mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| 1 | SIMPLE | thread | ref | forumid | forumid | 4 | const,const | 221575 | Using where; Using filesort |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
回答by Jon Black
You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
您应该阅读以下内容并了解一些精心设计的 innodb 表的优点以及如何最好地使用聚集索引 - 仅适用于 innodb !
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
then design your system something along the lines of the following simplified example:
然后按照以下简化示例的方式设计您的系统:
Example schema (simplified)
示例架构(简化)
The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clusteredkey based on a combination of forum_id and thread_id. e.g.
重要的特点是表使用innodb引擎,线程表的主键不再是单一的auto_incrementing键,而是基于forum_id和thread_id组合的复合聚簇键。例如
threads - primary key (forum_id, thread_id)
forum_id thread_id
======== =========
1 1
1 2
1 3
1 ...
1 2058300
2 1
2 2
2 3
2 ...
2 2352141
...
Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.
每个论坛行都包含一个名为 next_thread_id (unsigned int) 的计数器,该计数器由触发器维护并在每次将线程添加到给定论坛时递增。这也意味着如果对 thread_id 使用单个 auto_increment 主键,我们可以为每个论坛存储 40 亿个线程,而不是总共 40 亿个线程。
forum_id title next_thread_id
======== ===== ==============
1 forum 1 2058300
2 forum 2 2352141
3 forum 3 2482805
4 forum 4 3740957
...
64 forum 64 3243097
65 forum 65 15000000 -- ooh a big one
66 forum 66 5038900
67 forum 67 4449764
...
247 forum 247 0 -- still loading data for half the forums !
248 forum 248 0
249 forum 249 0
250 forum 250 0
The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:
使用复合键的缺点是您不能再通过单个键值选择线程,如下所示:
select * from threads where thread_id = y;
you have to do:
你必须要做:
select * from threads where forum_id = x and thread_id = y;
However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...
但是,您的应用程序代码应该知道用户正在浏览哪个论坛,因此实施起来并不困难 - 将当前查看的 forum_id 存储在会话变量或隐藏表单字段等中...
Here's the simplified schema:
这是简化的架构:
drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;
drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;
delimiter #
create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;
select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
set new.thread_id = v_id;
update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#
delimiter ;
You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.
您可能已经注意到我已经将 reply_count 作为主键的一部分包含在内,这有点奇怪,因为 (forum_id, thread_id) 复合本身是唯一的。这只是一个索引优化,它在执行使用 reply_count 的查询时节省了一些 I/O。有关这方面的更多信息,请参阅上面的 2 个链接。
Example queries
示例查询
I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:
我仍在将数据加载到我的示例表中,到目前为止我已经加载了大约。5 亿行(是您系统的一半)。当加载过程完成时,我应该期望有大约:
250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)
I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:
我故意让一些论坛的帖子超过500万,例如65论坛有1500万个帖子:
forum_id title next_thread_id
======== ===== ==============
65 forum 65 15000000 -- ooh a big one
Query runtimes
查询运行时
select sum(next_thread_id) from forums;
sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)
under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:
在 innodb 下总结 next_thread_ids 以给出总线程数比通常快得多:
select count(*) from threads;
How many threads does forum 65 have:
65论坛有多少个帖子:
select next_thread_id from forums where forum_id = 65
next_thread_id
==============
15,000,000 (15 million)
again this is faster than the usual:
这再次比平常更快:
select count(*) from threads where forum_id = 65
Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)
好的,现在我们知道到目前为止我们有大约 5 亿个线程,而论坛 65 有 1500 万个线程 - 让我们看看模式如何执行:)
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;
runtime = 0.022 secs
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;
runtime = 0.027 secs
Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)
对我来说看起来非常高效 - 所以这是一个包含 500 多万行(并且还在增长)的单个表,其中一个查询在 0.02 秒内覆盖了 1500 万行(在负载下!)
Further optimisations
进一步优化
These would include:
这些将包括:
partitioning by range
sharding
throwing money and hardware at it
按范围划分
分片
向它扔钱和硬件
etc...
等等...
hope you find this answer helpful :)
希望你觉得这个答案有帮助:)
回答by Victor Nicollet
EDIT: Your one-column indices are not enough. You would need to, at least, cover the three involved columns.
编辑:您的一列索引是不够的。您至少需要涵盖三个涉及的列。
More advanced solution: replace replycount > 1
with hasreplies = 1
by creating a new hasreplies
field that equals 1 when replycount > 1
. Once this is done, create an index on the three columns, in that order: INDEX(forumid, hasreplies, dateline)
. Make sure it's a BTREE index to support ordering.
更高级的解决方案:通过创建一个等于 1的新字段来替换replycount > 1
with 。完成此操作后,按以下顺序在三列上创建索引:。确保它是支持排序的 BTREE 索引。hasreplies = 1
hasreplies
replycount > 1
INDEX(forumid, hasreplies, dateline)
You're selecting based on:
您选择的依据是:
- a given
forumid
- a given
hasreplies
- ordered by
dateline
- 给定的
forumid
- 给定的
hasreplies
- 订购
dateline
Once you do this, your query execution will involve:
执行此操作后,您的查询执行将涉及:
- moving down the BTREE to find the subtree that matches
forumid = X
. This is a logarithmic operation (duration : log(number of forums)). - moving further down the BTREE to find the subtree that matches
hasreplies = 1
(while still matchingforumid = X
). This is a constant-time operation, becausehasreplies
is only 0 or 1. - moving through the dateline-sorted subtree in order to get the required results, without having to read and re-sort the entire list of items in the forum.
- 向下移动 BTREE 以找到匹配的子树
forumid = X
。这是一个对数运算(持续时间:日志(论坛数量))。 - 进一步向下移动 BTREE 以找到匹配的子树
hasreplies = 1
(同时仍然匹配forumid = X
)。这是一个恒定时间操作,因为hasreplies
它只有 0 或 1。 - 在日期线排序的子树中移动以获得所需的结果,而无需阅读和重新排序论坛中的整个项目列表。
My earlier suggestion to index on replycount
was incorrect, because it would have been a range query and thus prevented the use of a dateline
to sort the results (so you would have selected the threads with replies very fast, but the resulting million-line list would have had to be sorted completely before looking for the 100 elements you needed).
我之前关于索引的建议replycount
是不正确的,因为它本来是一个范围查询,因此阻止了使用 adateline
对结果进行排序(因此您会非常快速地选择回复的线程,但生成的百万行列表将具有在查找您需要的 100 个元素之前必须完全排序)。
IMPORTANT: while this improves performance in all cases, your huge OFFSET value (10000!) is going to decrease performance, because MySQL does not seem to be able to skip ahead despite reading straight through a BTREE. So, the larger your OFFSET is, the slower the request will become.
重要提示:虽然这在所有情况下都提高了性能,但您巨大的 OFFSET 值(10000!)会降低性能,因为尽管直接阅读 BTREE,MySQL 似乎无法跳过。因此,您的 OFFSET 越大,请求将变得越慢。
I'm afraid the OFFSET problem is not automagically solved by spreading the computation over several computations (how do you skip an offset in parallel, anyway?) or moving to NoSQL. All solutions (including NoSQL ones) will boil down to simulating OFFSET based on dateline
(basically saying dateline > Y LIMIT 100
instead of LIMIT Z, 100
where Y
is the date of the item at offset Z
). This works, and eliminates any performance issues related to the offset, but prevents going directly to page 100 out of 200.
恐怕 OFFSET 问题不能通过将计算分散到多个计算中来自动解决(无论如何,你如何并行跳过偏移量?)或转移到 NoSQL。所有解决方案(包括 NoSQL 解决方案)都将归结为模拟 OFFSET 基于dateline
(基本上是说dateline > Y LIMIT 100
而不是在 offset 处的项目日期LIMIT Z, 100
在哪里)。这有效,并消除了与偏移量相关的任何性能问题,但会阻止直接转到 200 页中的第 100 页。Y
Z
回答by T?nu Samuel
There is are part of question which related to NoSQL or MySQL option. Actually this is one fundamental thing hidden here. SQL language is easy to write for human and bit difficult to read for computer. In high volume databases I would recommend to avoid SQL backend as this requires extra step - command parsing. I have done extensive benchmarking and there are cases when SQL parser is slowest point. There is nothing you can do about it. Ok, you can possible use pre-parsed statements and access them.
有部分问题与 NoSQL 或 MySQL 选项有关。实际上,这是隐藏在这里的一件基本事情。SQL 语言对于人类来说很容易编写,对于计算机来说则有点难以阅读。在大容量数据库中,我建议避免使用 SQL 后端,因为这需要额外的步骤 - 命令解析。我已经进行了广泛的基准测试,并且在某些情况下 SQL 解析器是最慢的点。你对此无能为力。好的,您可以使用预先解析的语句并访问它们。
BTW, it is not wide known but MySQL has grown out from NoSQL database. Company where authors of MySQL David and Monty worked was data warehousing company and they often had to write custom solutions for uncommon tasks. This leaded to big stack of homebrew C libraries used to manually write database functions when Oracle and other were performing poorly. SQL was added to this nearly 20 years old zoo on 1996 for fun. What came after you know.
顺便说一句,它并不广为人知,但 MySQL 是从 NoSQL 数据库发展而来的。MySQL David 和 Monty 的作者工作的公司是数据仓库公司,他们经常不得不为不常见的任务编写自定义解决方案。这导致大量自制 C 库用于在 Oracle 和其他性能不佳时手动编写数据库函数。为了好玩,SQL 于 1996 年被添加到这个近 20 年的动物园中。知道之后的事情。
Actually you can avoid SQL overhead with MySQL. But usually SQL parsing is not the slowest part but just good to know. To test parser overhead you may just make benchmark for "SELECT 1" for example ;).
实际上,您可以避免使用 MySQL 的 SQL 开销。但通常 SQL 解析不是最慢的部分,但很高兴知道。要测试解析器开销,您可以只为“SELECT 1”做基准测试;)。
回答by Dan Grossman
You should not be trying to fit a database architecture to hardware you're planning to buy, but instead plan to buy hardware to fit your database architecture.
您不应该试图让数据库架构适合您计划购买的硬件,而应该计划购买适合您的数据库架构的硬件。
Once you have enough RAM to keep the working set of indexes in memory, all your queries that can make use of indexes will be fast. Make sure your key buffer is set large enough to hold the indexes.
一旦您有足够的 RAM 来将工作索引集保存在内存中,所有可以使用索引的查询都会很快。确保您的密钥缓冲区设置得足够大以容纳索引。
So if 12GB is not enough, don't use 10 servers with 12GB of RAM, use fewer with 32GB or 64GB of RAM.
因此,如果 12GB 不够,请不要使用 10 个具有 12GB RAM 的服务器,使用 32GB 或 64GB RAM 的服务器较少。
回答by descent89
Indices are a must - but remember to choose the right type of index: BTREE is more suitable when using queries with "<" or ">" in your WHERE clauses, while HASH is more suitable when you have many distinct values in one column and you are using "=" or "<=>" in your WHERE clause.
索引是必须的——但请记住选择正确的索引类型:BTREE 更适合在 WHERE 子句中使用带有“<”或“>”的查询,而 HASH 更适合在一列中有许多不同的值并且您在 WHERE 子句中使用了“=”或“<=>”。
Further reading http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
进一步阅读http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html