如何加速 MongoDB 插入/秒?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7265176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 12:13:45  来源:igfitidea点击:

How to speed up MongoDB Inserts/sec?

perlmongodb

提问by EhevuTov

I'm trying to maximize inserts per second. I currently get around 20k inserts/sec. My performance is actually degrading the more threads and CPU I use (I have 16 cores available). 2 threads currently do more per sec than 16 threads on a 16 core dual processor machine. Any ideas on what the problem is? Is it because I'm using only one mongod? Is it indexing that could be slowing things down? Do I need to use sharding? I wonder if there's a way to shard, but also keep the database capped...

我正在尝试最大化每秒插入次数。我目前获得大约 20k 插入/秒。我使用的线程和 CPU 越多,我的性能实际上就越低(我有 16 个可用内核)。在 16 核双处理器机器上,2 个线程目前每秒执行的操作比 16 个线程多。关于问题出在哪里的任何想法?是因为我只使用一个 mongod 吗?是索引可能会减慢速度吗?我需要使用分片吗?我想知道是否有一种方法可以进行分片,但也可以保持数据库的上限...

Constraints: must handle around 300k inserts/sec, must be self-limiting(capped), must be query-able relatively quickly

约束:必须处理大约 300k 插入/秒,必须是自限制的(上限),必须能够相对快速地查询

Problem Space: must handle call records for a major cellphone company (around 300k inserts/sec) and make those call records query-able for as long as possible (a week, for instance)

问题空间:必须处理大型手机公司的通话记录(大约 30 万次插入/秒),并使这些通话记录可查询尽可能长的时间(例如,一周)

#!/usr/bin/perl

use strict;
use warnings;
use threads;
use threads::shared;

use MongoDB;
use Time::HiRes;

my $conn = MongoDB::Connection->new;

my $db = $conn->tutorial;

my $users = $db->users;

my $cmd = Tie::IxHash->new(
    "create"    => "users",
    "capped"    => "boolean::true",
    "max"       => 10000000,
    );

$db->run_command($cmd);

my $idx = Tie::IxHash->new(
    "background"=> "boolean::true",
);
$users->ensure_index($idx);


my $myhash =
    {
        "name"  => "James",
        "age"   => 31,
        #    "likes" => [qw/Danielle biking food games/]
    };

my $j : shared = 0;

my $numthread = 2;  # how many threads to run

my @array;
for (1..100000) {
    push (@array, $myhash);
    $j++;
}

sub thInsert {
    #my @ids = $users->batch_insert(\@array);
    #$users->bulk_insert(\@array);
    $users->batch_insert(\@array);
}

my @threads;

my $timestart = Time::HiRes::time();
push @threads, threads->new(\&thInsert) for 1..$numthread;
$_->join foreach @threads; # wait for all threads to finish
print (($j*$numthread) . "\n");
my $timeend = Time::HiRes::time();

print( (($j*$numthread)/($timeend - $timestart)) . "\n");

$users->drop();
$db->drop();

采纳答案by Chris Fulstow

Writes to MongoDB currently aquire a global write lock, although collection level lockingis hopefully coming soon. By using more threads you're likely introducing more concurrency problems as the threads block eachother while they wait for the lock to be released.

对 MongoDB 的写入目前需要一个全局写入锁,尽管集合级锁定有望很快到来。通过使用更多线程,您可能会引入更多并发问题,因为线程在等待释放锁时会互相阻塞。

Indexes will also slow you down, to get the best insert performance it's ideal to add them after you've loaded your data, however this isn't always possible, for example if you're using a unique index.

索引也会减慢您的速度,为了获得最佳插入性能,最好在加载数据后添加它们,但这并不总是可行的,例如,如果您使用的是唯一索引。

To really maximise write performance, your best bet is sharding. This'll give you a much better concurrency and higher disk I/O capacity as you distribute writes across several machines.

为了真正最大限度地提高写入性能,最好的办法是分片。当您在多台机器上分配写入时,这将为您提供更好的并发性和更高的磁盘 I/O 容量。

回答by Thilo

2 threads currently do more per sec than 16 threads on a 16 core dual processor machine.

在 16 核双处理器机器上,2 个线程目前每秒执行的操作比 16 个线程多。

MongoDB inserts cannot be done concurrently. Every insert needs to acquire a write lock. Not sure if that is a global or a per-collection lock, but in your case that would not make a difference.

MongoDB 插入不能同时进行。每个插入都需要获取一个写锁。不确定这是全局锁还是每个集合锁,但在您的情况下,这不会有什么不同。

So making this program multi-threaded does not make much sense as soon as Mongo becomes the bottleneck.

所以一旦 Mongo 成为瓶颈,让这个程序多线程并没有多大意义。

Do I need to use sharding?

我需要使用分片吗?

You cannot shard a capped collection.

您不能对有上限的集合进行分片。

回答by jeffsaracco

I've noticed that building the index after inserting helps.

我注意到在插入后构建索引有帮助。

回答by Karoly Horvath

uhmm.. you won't get that much performance from one mongodb server.

嗯……你不会从一台 mongodb 服务器获得那么多的性能。

0.3M * 60 * 60 * 24 = 26G records/day, 180G records/week. I guess your records size is around 100 bytes, so that's 2.6TB data/day. I don't know what field(s) do you use for indexing but I doubt it's below 10-20 bytes, so just the daily index is going to be over 2G, not to mention the whole week.. the index won't fit into memory, with a lot of queries that's a good recipe for disaster.

0.3M * 60 * 60 * 24 = 26G 记录/天,180G 记录/周。我猜你的记录大小大约是 100 字节,所以这是 2.6TB 数据/天。我不知道你用什么字段来建立索引,但我怀疑它低于 10-20 字节,所以只是每日索引将超过 2G,更不用说整个星期了.. 索引不会适合内存,有很多查询是灾难的好方法。

You should do manualsharding, partitioning the data based on the search field(s). It's a major tel company, you should do replication. Buy a lot of single/dual core machines, you only need cores for the main (perl?) server.

您应该进行手动分片,根据搜索字段对数据进行分区。这是一家大型电信公司,您应该进行复制。购买大量单核/双核机器,您只需要主(perl?)服务器的内核。

BTW how do you query the data? Could you use a key-value store?

顺便说一句,你如何查询数据?您可以使用键值存储吗?

回答by Ben Mackey

Why don't you manually cap the collection? You could shard across multiple machines and apply the indexes you need for the queries, and then every hour or so delete the unwanted documents.

为什么不手动设置集合的上限?您可以跨多台机器进行分片并应用查询所需的索引,然后每隔一小时左右删除不需要的文档。

The bottleneck you have is most likely the global lock - I have seen this happen in my evaluation of MongoDB for a insert-heavy time-series data application. You need to make sure the shard key is not the timestamp, otherwise all the inserts will execute sequentially on the same machine instead of being distributed across multiple machines.

您遇到的瓶颈很可能是全局锁 - 我在我对 MongoDB 的插入大量时间序列数据应用程序的评估中看到了这种情况。您需要确保分片键不是时间戳,否则所有插入将在同一台机器上顺序执行,而不是分布在多台机器上。

回答by Eren Güven

Write lock on MongoDB is global but quoting this"collection-level locking coming soon".

MongoDB 上的写锁是全局的,但引用“即将推出的集合级锁定”。

Do I need to use sharding?

我需要使用分片吗?

Not so easy to answer. If what you can get out of one mongodis not meeting your requirements, you kind of have to since sharding is the only way to scale writes on MongoDB (writes on different instances will not block each other).

没那么容易回答。如果你从一个mongod 中得到的东西不能满足你的要求,你必须这样做,因为分片是扩展 MongoDB 写入的唯一方法(不同实例上的写入不会相互阻塞)。