C# 多核文本文件解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multicore Text File Parsing
提问by lomaxx
I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
我有一台四核机器,想编写一些代码来解析利用所有四个内核的文本文件。文本文件基本上每行包含一条记录。
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
多线程不是我的强项,所以我想知道是否有人可以给我一些模式,我可以用来以最佳方式解析文件。
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
我的第一个想法是将所有行读入某种队列,然后启动线程以将行从队列中拉出并处理它们,但这意味着队列必须存在于内存中,而且这些文件相当大,所以我我不太热衷于这个想法。
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
我的下一个想法是使用某种控制器来读取一行并为其分配一个要解析的线程,但是如果线程处理这些行的速度比它快,我不确定控制器是否最终会成为瓶颈阅读并分配它们。
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.
我知道可能还有比这两个更简单的解决方案,但目前我只是没有看到它。
采纳答案by Mike Minutillo
I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
我会同意你最初的想法。如果您担心队列可能会变得太大,请为其实施缓冲区(即,如果超过 100 行,则停止读取文件,如果低于 20 行,则再次开始读取。您需要进行一些测试找到最佳障碍)。使任何线程都可能成为“读取器线程”,因为它必须锁定队列以将项目拉出无论如何它还可以检查“低缓冲区”是否已被命中并再次开始读取。当它这样做时,其他线程可以读出队列的其余部分。
Or if you prefer, have one reader thread assign the lines to three other processorthreads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.
或者,如果您愿意,可以让一个读取器线程将这些行分配给其他三个处理器线程(通过它们自己的队列)并实施工作窃取策略。我从来没有这样做过,所以我不知道这有多难。
回答by Chris Jester-Young
My experience is with Java, not C#, so apologies if these solutions don't apply.
我的经验是使用 Java,而不是 C#,所以如果这些解决方案不适用,我深表歉意。
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors
.newFixedThreadPool
, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService
.submit
). The executor will queue requests for you, and allocate between the 3 threads.
我能想到的直接解决方案是让执行程序运行 3 个线程(例如,使用)。对于从输入文件中读取的每一行/记录,在执行器上触发一个作业(使用)。执行器将为您排队请求,并在 3 个线程之间进行分配。Executors
.newFixedThreadPool
ExecutorService
.submit
Probably better solutions exist, but hopefully that will do the job. :-)
可能存在更好的解决方案,但希望这能完成这项工作。:-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA:听起来很像 Wolfbyte 的第二个解决方案。:-)
ETA2: System.Threading.ThreadPool
sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!
ETA2:System.Threading.ThreadPool
在 .NET 中听起来是一个非常相似的想法。我从来没有用过它,但它可能值得你花时间!
回答by Mark Harrison
This will eliminate bottlenecks of having a single thread do the reading:
这将消除单线程读取的瓶颈:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file
回答by Derek Park
Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
Mark 的答案是更简单、更优雅的解决方案。如果没有必要,为什么要构建具有线程间通信的复杂程序?生成 4 个线程。每个线程计算 size-of-file/4 以确定它的起点(和停止点)。然后每个线程可以完全独立地工作。
The onlyreason to add a special thread to handle reading is if you expect some lines to take a very long time to process andyou expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.
添加特殊线程来处理读取的唯一原因是,如果您希望某些行需要很长时间来处理,并且您希望这些行聚集在文件的单个部分中。在不需要时添加线程间通信是一个非常糟糕的主意。您大大增加了引入意外瓶颈和/或同步错误的机会。
回答by Derek Park
@lomaxx
@lomaxx
@Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
@Derek & Mark:我希望有一种方法可以接受 2 个答案。我将不得不最终使用 Wolfbyte 的解决方案,因为如果我将文件分成 n 个部分,则线程有可能遇到一批“慢”事务,但是如果我正在处理一个文件,其中每个进程保证需要等量的处理然后我真的很喜欢你的解决方案,即将文件分成块并将每个块分配给一个线程并完成它。
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)
不用担心。如果集群的“慢”事务是一个问题,那么排队解决方案就是要走的路。根据平均事务的快慢程度,您可能还需要考虑一次为每个工作人员分配多条线路。这将减少同步开销。同样,您可能需要优化缓冲区大小。当然,这两项都是优化,您可能应该只在分析之后进行。(如果不是瓶颈,就不必担心同步。)
回答by graham.reeds
Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumerpattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free Listcode.
由于瓶颈通常在处理过程中而不是在处理文件时读取,因此我将采用生产者-消费者模式。为了避免锁定,我会查看无锁列表。由于您使用的是 C#,您可以查看 Julian Bucknall 的Lock-Free List代码。
回答by Adisak
If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
如果您正在解析的文本由重复的字符串和标记组成,请将文件分成块,对于每个块,您可以让一个线程将其预解析为由关键字、“标点”、ID 字符串和值组成的标记。字符串比较和查找可能非常昂贵,如果不需要进行字符串查找和比较,将其传递给多个工作线程可以加速代码的纯逻辑/语义部分。
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
然后可以将预解析的数据块(您已经完成所有字符串比较并对其进行“标记化”)传递给实际查看标记化数据的语义和排序的代码部分。
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
此外,您提到您担心占用大量内存的文件大小。您可以采取一些措施来减少内存预算。
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
将文件拆分成块并解析它。一次只读取与您正在处理的块一样多的块,再加上一些用于“预读”的块,因此在处理完一个块后,您不会在磁盘上停滞,然后再转到下一个块。
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.
或者,大文件可以进行内存映射并“按需”加载。如果处理文件的线程比 CPU 多(通常线程 = 1.5-2X CPU 对按需分页应用程序来说是一个很好的数字),则在内存映射文件的 IO 上停滞的线程将自动从操作系统停止,直到它们内存准备好,其他线程将继续处理。