MongoDB:糟糕的 MapReduce 性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3947889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MongoDB: Terrible MapReduce Performance
提问by mellowsoon
I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long.
我在关系数据库方面有着悠久的历史,但我是 MongoDB 和 MapReduce 的新手,所以我几乎肯定我一定做错了什么。我将直接进入问题。对不起,如果它很长。
I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows.
我在 MySQL 中有一个数据库表,用于跟踪每天的成员个人资料查看次数。为了测试,它有 10,000,000 行。
CREATE TABLE `profile_views` (
`id` int(10) unsigned NOT NULL auto_increment,
`username` varchar(20) NOT NULL,
`day` date NOT NULL,
`views` int(10) unsigned default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `username` (`username`,`day`),
KEY `day` (`day`)
) ENGINE=InnoDB;
Typical data might look like this.
典型数据可能如下所示。
+--------+----------+------------+------+
| id | username | day | hits |
+--------+----------+------------+------+
| 650001 | Joe | 2010-07-10 | 1 |
| 650002 | Jane | 2010-07-10 | 2 |
| 650003 | Hyman | 2010-07-10 | 3 |
| 650004 | Jerry | 2010-07-10 | 4 |
+--------+----------+------------+------+
I use this query to get the top 5 most viewed profiles since 2010-07-16.
我使用此查询来获取自 2010 年 7 月 16 日以来查看次数最多的前 5 个个人资料。
SELECT username, SUM(hits)
FROM profile_views
WHERE day > '2010-07-16'
GROUP BY username
ORDER BY hits DESC
LIMIT 5\G
This query completes in under a minute. Not bad!
此查询在一分钟内完成。不错!
Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys).
现在进入 MongoDB 的世界。我使用 3 个服务器设置了一个分片环境。服务器 M、S1 和 S2。我使用以下命令来设置装备(注意:我已经隐藏了 IP 地址)。
S1 => 127.20.90.1
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log
S2 => 127.20.90.7
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log
M => 127.20.4.1
./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log
./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog
Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands:
一旦这些启动并运行,我就跳到服务器 M 上,并启动了 mongo。我发出了以下命令:
use admin
db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
db.runCommand( { enablesharding : "profiles" } );
db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } );
use profiles
db.views.ensureIndex({ hits: -1 });
I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this:
然后我从 MySQL 导入了相同的 10,000,000 行,这给了我看起来像这样的文档:
{
"_id" : ObjectId("4cb8fc285582125055295600"),
"username" : "Joe",
"day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)",
"hits" : 16
}
Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this.
现在来这里真正的肉和土豆......我的地图和减少功能。回到 shell 中的服务器 M 我设置查询并像这样执行它。
use profiles;
var start = new Date(2010, 7, 16);
var map = function() {
emit(this.username, this.hits);
}
var reduce = function(key, values) {
var sum = 0;
for(var i in values) sum += values[i];
return sum;
}
res = db.views.mapReduce(
map,
reduce,
{
query : { day: { $gt: start }}
}
);
And here's were I run into problems. This query took over 15 minutes to complete!The MySQL query took under a minute. Here's the output:
这就是我遇到的问题。这个查询花了 15 多分钟才完成!MySQL 查询耗时不到一分钟。这是输出:
{
"result" : "tmp.mr.mapreduce_1287207199_6",
"shardCounts" : {
"127.20.90.7:10000" : {
"input" : 4917653,
"emit" : 4917653,
"output" : 1105648
},
"127.20.90.1:10000" : {
"input" : 5082347,
"emit" : 5082347,
"output" : 1150547
}
},
"counts" : {
"emit" : NumberLong(10000000),
"input" : NumberLong(10000000),
"output" : NumberLong(2256195)
},
"ok" : 1,
"timeMillis" : 811207,
"timing" : {
"shards" : 651467,
"final" : 159740
},
}
Not only did it take forever to run, but the results don't even seem to be correct.
不仅要花很长时间才能运行,而且结果甚至似乎都不正确。
db[res.result].find().sort({ hits: -1 }).limit(5);
{ "_id" : "Joe", "value" : 128 }
{ "_id" : "Jane", "value" : 2 }
{ "_id" : "Jerry", "value" : 2 }
{ "_id" : "Hyman", "value" : 2 }
{ "_id" : "Jessy", "value" : 3 }
I know those value numbers should be much higher.
我知道这些价值数字应该高得多。
My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query.
我对整个 MapReduce 范式的理解是执行此查询的任务应该在所有分片成员之间拆分,这应该会提高性能。我一直等到 Mongo 完成在导入后在两个分片服务器之间分发文档。当我开始这个查询时,每个人几乎都有 5,000,000 个文档。
So I must be doing something wrong. Can anyone give me any pointers?
所以我一定是做错了什么。任何人都可以给我任何指示吗?
Edit: Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.
编辑:IRC 上有人提到在 day 字段上添加索引,但据我所知,这是由 MongoDB 自动完成的。
回答by nonopolarity
excerpts from MongoDB Definitive Guide from O'Reilly:
摘自 O'Reilly 的 MongoDB 权威指南:
The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.
使用 MapReduce 的代价是速度:group 不是特别快,但 MapReduce 更慢,不应该“实时”使用。您将 MapReduce 作为后台作业运行,它会创建一个结果集合,然后您可以实时查询该集合。
options for map/reduce:
"keeptemp" : boolean
If the temporary result collection should be saved when the connection is closed.
"output" : string
Name for the output collection. Setting this option implies keeptemp : true.
回答by FrameGrace
Maybe I'm too late, but...
也许我来得太晚了,但是...
First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day".
首先,您要查询集合以在没有索引的情况下填充 MapReduce。你应该在“天”创建一个索引。
MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key.
MongoDB MapReduce 在单个服务器上是单线程的,但在分片上并行化。mongo 分片中的数据保存在按分片键排序的连续块中。
As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast.
由于您的分片键是“day”,并且您正在查询它,因此您可能只使用了三台服务器中的一台。分片键仅用于传播数据。Map Reduce 将使用每个分片上的“day”索引进行查询,并且速度非常快。
Add something in front of the day key to spread the data. The username can be a good choice.
在日期键前面添加一些东西来传播数据。用户名可能是一个不错的选择。
That way the Map reduce will be launched on all servers and hopefully reducing the time by three.
这样 Map reduce 将在所有服务器上启动,并希望将时间减少三个。
Something like this:
像这样的东西:
use admin
db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
db.runCommand( { enablesharding : "profiles" } );
db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } );
use profiles
db.views.ensureIndex({ hits: -1 });
db.views.ensureIndex({ day: -1 });
I think with those additions, you can match MySQL speed, even faster.
我认为通过这些添加,您可以匹配 MySQL 的速度,甚至更快。
Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.
另外,最好不要实时使用它。如果您的数据不需要“精确”精确,则时不时地执行 map reduce 任务并使用结果集合。
回答by Joris Bontje
You are not doing anything wrong. (Besides sorting on the wrong value as you already noticed in your comments.)
你没有做错任何事。(除了您在评论中已经注意到的错误值排序之外。)
MongoDB map/reduce performance just isn't that great. This is a known issue; see for example http://jira.mongodb.org/browse/SERVER-1197where a naive approach is ~350x faster than M/R.
MongoDB map/reduce 的性能并不是那么好。这是一个已知的问题; 参见例如http://jira.mongodb.org/browse/SERVER-1197,其中天真的方法比 M/R 快 350 倍。
One advantage though is that you can specify a permanent output collection name with the out
argument of the mapReduce
call. Once the M/R is completed the temporary collection will be renamed to the permanent name atomically. That way you can schedule your statistics updates and query the M/R output collection real-time.
但是,一个优点是您可以使用调用out
参数指定永久输出集合名称mapReduce
。一旦 M/R 完成,临时集合将自动重命名为永久名称。这样您就可以安排统计更新并实时查询 M/R 输出集合。
回答by Rogerio Hilbert
Have you already tried using hadoop connector for mongodb?
您是否已经尝试过为 mongodb 使用 hadoop 连接器?
Look at this link here: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
在此处查看此链接:http: //docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
Since you are using only 3 shards, I don't know whether this approach would improve your case.
由于您只使用了 3 个分片,我不知道这种方法是否会改善您的情况。