windows 如何从磁盘获得良好的并发读取性能

Question

提问by pauldoo

I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.

我想问一个问题，然后用我自己的答案跟进，但也看看其他人有什么答案。

We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.

我们有两个大文件，我们想同时从两个单独的线程中读取它们。一个线程将顺序读取文件A，而另一个线程将顺序读取文件B。线程之间没有锁定或通信，两者都尽可能快地顺序读取，并且都立即丢弃它们读取的数据。

Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.

我们在 Windows 上进行此设置的经验非常差。两个线程的总吞吐量约为 2-3 MiB/秒。驱动器似乎花费大部分时间在两个文件之间来回查找，大概每次查找后读取的内容很少。

If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.

如果我们禁用其中一个线程并暂时查看单个线程的性能，那么我们将获得更好的带宽（这台机器约为 45 MiB/秒）。很明显，糟糕的双线程性能是操作系统磁盘调度程序的产物。

Is there anything we can do to improve the concurrent thread read performance?Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.

我们可以做些什么来提高并发线程读取性能？也许通过使用不同的 API 或以某种方式调整 OS 磁盘调度程序参数。

Some details:

一些细节：

The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.

在具有 2GiB RAM 的机器上，这些文件的大小为每个 2 GiB。出于这个问题的目的，我们认为它们不会被缓存和完美的碎片整理。我们使用了碎片整理工具并重新启动以确保是这种情况。

We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.

我们没有使用特殊的 API 来读取这些文件。该行为在各种沼泽标准 API 中是可重复的，例如 Win32 的 CreateFile、C 的 fopen、C++ 的 std::ifstream、Java 的 FileInputStream 等。

Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.

每个线程在循环中旋转，调用 read 函数。我们将每次迭代从 API 请求的字节数从 1KiB 到 128MiB 之间的值进行了更改。改变这个没有任何影响，所以很明显，操作系统在每次磁盘搜索后物理读取的数量不是由这个数字决定的。这正是应该预料到的。

The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.

单线程和双线程性能之间的巨大差异在 Windows 2000、Windows XP（32 位和 64 位）、Windows Server 2003 以及有和没有硬件 RAID5 的情况下都可以重复。

Answer 1

采纳答案by Andrea Bertani

The problem seems to be in Windows I/O scheduling policy. According to what I found herethere are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to hereand here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):

问题似乎出在 Windows I/O 调度策略中。根据我在这里发现的内容，操作系统可以通过多种方式来安排磁盘请求。虽然 Linux 和其他人可以在不同的策略之间进行选择，但在 Vista 之前，Windows 被锁定在一个策略中：一个 FIFO 队列，其中所有请求都被分成 64 KB 块。我相信这个策略是您遇到问题的原因：调度程序将混合来自两个线程的请求，导致磁盘不同区域之间的连续寻道。
现在，好消息是，根据here和here，Vista引入了一个更智能的磁盘调度程序，您可以在其中设置请求的优先级，并为您的进程分配最小的badwidth。
坏消息是我发现无法在以前版本的 Windows 中更改磁盘策略或缓冲区大小。此外，即使提高进程的磁盘 I/O 优先级将提高与其他进程相比的性能，您仍然会遇到线程相互竞争的问题。
我可以建议的是通过引入自制的磁盘访问策略来修改您的软件。
例如，您可以在线程 B（类似于线程 A）中使用这样的策略：

if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again

You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue. The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.

您可以使用信号量进行状态检查，也可以使用 perfmon 计数器来获取实际磁盘队列的状态。X 和/或 Y 的值也可以通过检查实际传输速率自动调整并慢慢修改它们，从而在应用程序在不同机器和/或操作系统上运行时最大化吞吐量您可以找到缓存、内存或 RAID 级别以某种方式影响它们，但通过自动调整，您将始终在每种情况下获得最佳性能。

Answer 2

回答by pauldoo

I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.

我想在我的回复中添加一些进一步的注释。我们测试过的所有其他非 Microsoft 操作系统都不会遇到此问题。当从一个线程移动到两个线程时，Linux、FreeBSD 和 Mac OS X（这是在不同硬件上的最后一个）在聚合带宽方面都更加优雅地降级。例如，Linux 从 ~45 MiB/sec 降级到 ~42 MiB/sec。这些其他操作系统必须在每次搜索之间读取更大的文件块，因此不会花费几乎所有时间在磁盘上等待搜索。

Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERINGflag to CreateFileand use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:

我们的 Windows 解决方案是将FILE_FLAG_NO_BUFFERING标志传递给CreateFile并在每次调用ReadFile. 由于以下几个原因，这是次优的：

Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).

像这样读取文件时不会缓存文件，因此没有缓存通常提供的优势。
使用此标志时的约束比正常读取（读取缓冲区与页面边界对齐等）复杂得多。

(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)

（作为最后的评论。这是否解释了为什么在 Windows 下交换如此糟糕？即，Windows 无法以任何效率同时对多个文件进行 IO，因此在交换所有其他 IO 操作时，被迫变得不成比例地缓慢。）

Edit to add some further details for Will Dean:

编辑为 Will Dean 添加更多详细信息：

Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:

当然，在这些不同的硬件配置中，原始数据确实发生了变化（有时会发生很大变化）。然而，问题是性能持续下降，只有 Windows 在从一个线程移动到两个线程时才会受到影响。以下是经过测试的机器的摘要：

Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home KooluPC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.

多个不同时代的戴尔工作站（英特尔至强）运行 Windows 2000、Windows XP（32 位）和 Windows XP（64 位），带有单个驱动器。
运行 Windows Server 2003（64 位）和 RAID 1+0 的戴尔 1U 服务器（英特尔至强）。
带有 Windows XP（64 位）和 Windows Server 2003 以及硬件 RAID 5 的 HP 工作站（AMD Opteron）。
我的家用无品牌 PC (AMD Athlon64) 运行 Windows XP（32 位）、FreeBSD（64 位）和 Linux（64 位），单驱动器。
我的家用 MacBook（Intel Core1）运行 Mac OS X，单个 SATA 驱动器。
我的家用KooluPC 运行 Linux。与其他系统相比，动力明显不足，但我证明，在执行多线程磁盘读取时，即使是这台机器也可以胜过具有 RAID5 的 Windows 服务器。

CPU usage on all of these systems was very low during the tests and anti-virus was disabled.

在测试期间，所有这些系统的 CPU 使用率都非常低，并且禁用了防病毒功能。

I forgot to mention before but we also tried the normal Win32 CreateFileAPI with the FILE_FLAG_SEQUENTIAL_SCANflag set. This flag didn't fix the problem.

我之前忘了提及，但我们也尝试了设置标志的普通 Win32 CreateFileAPI FILE_FLAG_SEQUENTIAL_SCAN。这个标志没有解决问题。

Answer 3

回答by Will Dean

It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.

您在相当广泛的 Windows 版本中没有看到任何区别，并且在单个驱动器和硬件 raid-5 之间没有任何区别，这似乎有点奇怪。

It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?

这只是“直觉”，但这确实让我怀疑这真的是一个简单的寻求问题。除了 OS X 和 Raid5，所有这些都是在同一台机器上试过的——你试过另一台机器吗？在这次测试中，你的 CPU 使用率基本为零吗？

What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.

你能写出的最短的应用程序是什么来演示这个问题？- 我有兴趣在这里尝试一下。

Answer 4

回答by graham.reeds

Do you use IOCompletionPortsunder Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.

你在 Windows 下使用IOCompletionPorts吗？Windows via C++ 有一个关于这个主题的深入章节，幸运的是，它也可以在 MSDN 上找到。

Answer 5

回答by Will Dean

Paul - saw the update. Very interesting.

保罗 - 看到更新。很有意思。

It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.

在 Vista 或 Win2008 上尝试它会很有趣，因为人们似乎在某些情况下报告了对这些的一些相当大的 I/O 改进。

My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.

我对不同 API 的唯一建议是尝试对文件进行内存映射 - 您尝试过吗？不幸的是，在每个文件 2GB 的情况下，您将无法在 32 位机器上映射多个完整文件，这意味着这并不像它想象的那么简单。

Answer 6

回答by Stacey Richards

I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.

我会创建某种内存线程安全锁。每个线程都可以等待锁，直到它被释放。当锁空闲时，获取锁并在定义的时间长度或定义的数据量内读取文件，然后为任何其他等待线程释放锁。

windows 如何从磁盘获得良好的并发读取性能

提问by pauldoo

采纳答案by Andrea Bertani

回答by pauldoo

回答by Will Dean

回答by graham.reeds

回答by Will Dean

回答by Stacey Richards

相关推荐

最近更新

标签

windows 如何从磁盘获得良好的并发读取性能

提问by pauldoo

采纳答案by Andrea Bertani

回答by pauldoo

回答by Will Dean

回答by graham.reeds

回答by Will Dean

回答by Stacey Richards

相关推荐

Xcode 11，使用 Main.storyboard 修复了主界面

此 iPhone 6 运行的是 iOS 12.4.1 (16G102)，此版本的 Xcode 可能不支持

Xcode 11 调试器非常慢 - 一个已知问题？

xcode 从主线程访问后，不得从后台线程执行对 > 布局引擎的修改

相关推荐

最近更新

标签