Python 并行读取大文件？

Question

提问by Anush

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

我有一个大文件，我需要读取它并从中制作字典。我希望这尽可能快。但是我在 python 中的代码太慢了。这是一个显示问题的最小示例。

First make some fake data

先制作一些假数据

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

现在这里是一段最小的python代码来读取它并制作字典。

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

时间：

time ./read.py largefile.txt
real    0m55.746s

However it is possible to read the whole file much faster as:

但是，可以更快地读取整个文件，如下所示：

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?

我的 CPU 有 8 个内核，是否可以在 python 中并行化这个程序以加快速度？

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

一种可能性可能是读取输入的大块，然后在不同的非重叠子块上并行运行 8 个进程，从内存中的数据并行生成字典，然后读取另一个大块。这在python中以某种方式使用多处理可能吗？

Update. The fake data was not very good as it had only one value per key. Better is

更新。假数据不是很好，因为每个键只有一个值。更好的是

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

(Related to Read in large file and make dictionary.)

（与Read in large file and make dictionary 相关。）

Answer 1

采纳答案by Ecir Hana

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

几年前，Tim Bray 的网站 [1] 上有一篇关于此的博客文章系列“Wide Finder Project”。您可以在那里找到 ElementTree [3] 和 PIL [4] 的 Fredrik Lundh 的解决方案 [2]。我知道在这个网站上通常不鼓励发布链接，但我认为这些链接比复制粘贴他的代码给你更好的答案。

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http ://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

Answer 2

回答by oyhovd

One thing you could try is to get the line count from the file, then spawn 8 threads that makes a dictionary from 1/8th of the file each, then join the dictionaries when all threads are finished. This will probably speed it up if it is the appending that takes time and not the reading of the lines.

您可以尝试的一件事是从文件中获取行数，然后生成 8 个线程，每个线程从文件的 1/8 中创建一个字典，然后在所有线程完成后加入字典。如果附加需要时间而不是读取行，这可能会加快速度。

Answer 3

回答by Useless

It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.

可能可以并行化以加快速度，但并行执行多个读取不太可能有帮助。

Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).

您的操作系统不太可能有效地并行执行多次读取（使用条带化raid 阵列之类的情况除外，在这种情况下，您仍然需要知道步幅以最佳利用它）。

What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.

您可以做的是与读取并行运行相对昂贵的字符串/字典/列表操作。

So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.

因此，一个线程读取（大）块并将其推送到同步队列，一个或多个消费者线程从队列中拉出块，将它们分成几行，然后填充字典。

(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).

（如果你使用多个消费者线程，正如 Pappnese 所说，每个线程构建一个字典，然后加入它们）。

Hints:

提示：

... push chunks to a synchronized queue...
... one or more consumer threads...

... 将块推送到同步队列...
... 一个或多个消费者线程...

Re. bounty:

关于。赏金：

C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.

C 显然没有 GIL 可抗衡，因此多个消费者可能会更好地扩展。读取行为并没有改变。不利的一面是 C 缺少对哈希映射（假设您仍然需要 Python 样式的字典）和同步队列的内置支持，因此您必须找到合适的组件或编写自己的组件。多个消费者各自构建自己的字典，然后在最后合并它们的基本策略仍然可能是最好的。

Using strtok_rinstead of str.splitmay be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.

使用strtok_r而不是str.split可能更快，但请记住，您还需要手动管理所有字符串的内存。哦，您还需要逻辑来管理行片段。老实说，C 为您提供了很多选择，我认为您只需要对其进行分析并查看即可。

Answer 4

回答by lmjohns3

It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

使用处理池可以解决这样的问题似乎很诱人，但最终会比这复杂得多，至少在纯 Python 中是这样。

Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :

因为 OP 提到每个输入行上的列表实际上会比两个元素长，所以我使用以下方法制作了一个更真实的输入文件：

paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
  head -1000000 > largefile.txt

After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split()took approximately 2x longer than .append()on my machine.)

在分析原始代码后，我发现该过程中最慢的部分是行拆分例程。（.split()花费的时间比.append()我的机器长大约 2 倍。）

1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}

So I factored the split into another function and use a pool to distribute the work of splitting the fields :

所以我将拆分分解为另一个函数，并使用一个池来分配拆分字段的工作：

import sys
import collections
import multiprocessing as mp

d = collections.defaultdict(list)

def split(l):
    return l.split()

pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
    d[keys[0]].append(keys[1:])

Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :

不幸的是，添加池使速度减慢了 2 倍以上。原始版本如下所示：

$ time python process.py smallfile.txt 
real    0m7.170s
user    0m6.884s
sys     0m0.260s

versus the parallel version :

与并行版本：

$ time python process-mp.py smallfile.txt 
real    0m16.655s
user    0m24.688s
sys     0m1.380s

Because the .map()call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

因为.map()调用基本上要序列化（pickle）每个输入，发送到远程进程，然后反序列化（unpickle）远程进程的返回值，这种方式使用池要慢很多。通过向池中添加更多内核，您确实获得了一些改进，但我认为这从根本上是分配这项工作的错误方式。

To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.

为了真正加快跨内核的速度，我的猜测是您需要使用某种固定块大小来读取大块输入。然后您可以将整个块发送到工作进程并取回序列化列表（尽管这里的反序列化将花费您多少仍然未知）。读取固定大小的块中的输入听起来可能与预期的输入比较棘手，但是，因为我的猜测是每行的长度不一定相同。

Answer 5

回答by Mikhail M

More cardinal solution for slow dictionary appending: replace the dictionary with array of pairs of strings. Fill it and then sort.

对于慢速字典追加的更多主要解决方案：用字符串对数组替换字典。填充它然后排序。

Answer 6

回答by Bar?? ?UHADAR

If your data on file, is not changing so often, you can choose to serialize it. Python interpreter will deserialize it much more quickly. You can use cPickle module.

如果文件中的数据不经常更改，则可以选择对其进行序列化。Python 解释器将更快地反序列化它。您可以使用 cPickle 模块。

Or creating 8 separate processes is an other option. Because, having an only dict make it much more possible. You can interact between those processes via Pipe in "multiprocessing" module or, "socket" module.

或者创建 8 个单独的进程是另一种选择。因为，只有一个 dict 使它更有可能。您可以通过“多处理”模块或“套接字”模块中的管道在这些进程之间进行交互。

Best regards

此致

Bar?? ?UHADAR.

酒吧？？?UHADAR。

Python 并行读取大文件？

提问by Anush

采纳答案by Ecir Hana

回答by oyhovd

回答by Useless

回答by lmjohns3

回答by Mikhail M

回答by Bar?? ?UHADAR

相关推荐

最近更新

标签

Python 并行读取大文件？

提问by Anush

采纳答案by Ecir Hana

回答by oyhovd

回答by Useless

回答by lmjohns3

回答by Mikhail M

回答by Bar?? ?UHADAR

相关推荐

Python 在 Pandas DataFrame 中使用 set_index

Python 错误：“您正在尝试添加不可为空的字段”

Python 嵌套列表上的列表理解？

Python 将 y 轴格式化为百分比

相关推荐

最近更新

标签