Python 逐行处理非常大 (>20GB) 的文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16669428/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:18:59  来源:igfitidea点击:

Process very large (>20GB) text file line by line

pythonline

提问by Tom_b

I have a number of very large text files which I need to process, the largest being about 60GB.

我有许多非常大的文本文件需要处理,最大的大约 60GB。

Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.

每行在七个字段中有 54 个字符,我想从前三个字段中的每一个中删除最后三个字符 - 这应该将文件大小减少约 20%。

I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?

我是 Python 的新手,并且有一个代码可以以每小时 3.4 GB 的速度完成我想做的事情,但要成为一项有价值的练习,我真的需要至少获得 10 GB/小时的速度 - 有什么方法可以加快速度这了?这段代码并没有接近挑战我的处理器,所以我做出了一个没有受过教育的猜测,它受到内部硬盘驱动器的读写速度的限制?

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:
        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
        l = r.readline()
    r.close()
    w.close()

Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.

任何帮助将非常感激。我在 Windows 7 上使用 IDLE Python GUI 并且有 16GB 的内存 - 也许不同的操作系统会更有效?。

Edit:Here is an extract of the file to be processed.

编辑:这是要处理的文件的摘录。

70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158

回答by Muetze

You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.

您可以尝试先保存拆分结果,而不是每次需要字段时都保存。可能这会加速。

you can also try not to run it in gui. Run it in cmd.

您也可以尝试不要在 gui 中运行它。在cmd中运行它。

回答by Janne Karila

Read the file using for l in r:to benefit from buffering.

使用读取文件for l in r:以从缓冲中受益。

回答by John La Rooy

It's more idiomatic to write your code like this

像这样编写代码更惯用

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the splitonce, but if the CPU is not being taxed, this is likely to make very little difference

这里的主要节省是只做split一次,但如果 CPU 没有被征税,这可能没什么区别

It mayhelp to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only54MB of RAM!

一次保存几千行并一次性写入可能有助于减少硬盘的颠簸。一百万行只是54MB 的 RAM!

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

@Janne 建议,另一种生成线条的方法

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

回答by craastad

Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.

那些看起来像非常大的文件......为什么它们这么大?每行你在做什么处理?为什么不使用带有一些 map reduce 调用(如果适用)或对数据进行简单操作的数据库?数据库的目的是抽象处理和管理无法全部放入内存的大量数据。

You can start to play with the idea with sqlite3which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.

您可以开始尝试使用sqlite3的想法,它仅使用平面文件作为数据库。如果你觉得这个想法很有用,那么升级到更健壮和通用的东西,比如postgresql

Create a database

创建数据库

 conn = sqlite3.connect('pts.db')
 c = conn.cursor()

Creates a table

创建一个表

c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')

Then use one of the algorithms above to insert all the lines and points in the database by calling

然后使用上述算法之一通过调用将所有线和点插入数据库中

c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")

Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query

现在你如何使用它取决于你想做什么。例如,通过执行查询来处理文件中的所有点

c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")

And get nlines at a time from this query with

n从该查询中一次获取行

c.fetchmany(size=n)

I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.

我确定某处有更好的 sql 语句包装器,但您明白了。

回答by msw

Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:

您的代码相当不惯用,并且进行了比需要更多的函数调用。一个更简单的版本是:

ProcessLargeTextFile():
    with open("filepath") as r, open("output") as w:
        for line in r:
            fields = line.split(' ')
            fields[0:2] = [fields[0][:-3], 
                           fields[1][:-3],
                           fields[2][:-3]]
            w.write(' '.join(fields))

and I don't know of a modern filesystem that is slowerthan Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?

而且我不知道比 Windows的现代文件系统。既然你用这些庞大的数据文件作为数据库,你有没有考虑过使用真正的数据库?

Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?

最后,如果您只是对减小文件大小感兴趣,您是否考虑过压缩/压缩文件?

回答by seanhodges

ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:

As has been suggested already, you may want to use a for loop to make this more optimal.

正如已经建议的那样,您可能希望使用 for 循环来使其更加优化。

    x = l.split(' ')[0]
    y = l.split(' ')[1]
    z = l.split(' ')[2]

You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

您在这里执行了 3 次拆分操作,这取决于每行的大小,这将对性能产生不利影响。您应该拆分一次并将 x,y,z 分配给返回的数组中的条目。

    w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

您正在阅读的每一行,都会立即写入文件,这是非常 I/O 密集型的。您应该考虑将输出缓冲到内存并定期推送到磁盘。像这样的东西:

BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    buf = ""
    bufLines = 0
    for lineIn in r:

        x, y, z = lineIn.split(' ')[:3]
        lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
        bufLines+=1

        if bufLines >= BUFFER_SIZE:
            # Flush buffer to disk
            w.write(buf)
            buf = ""
            bufLines=1

        buf += lineOut + "\n"

    # Flush remaining buffer to disk
    w.write(buf)
    buf.close()
    r.close()
    w.close()

You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

您可以调整 BUFFER_SIZE 以确定内存使用和速度之间的最佳平衡。

回答by Achim

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

措施!你得到了一些关于如何改进你的 python 代码的有用提示,我同意它们。但你应该首先弄清楚,你真正的问题是什么。我找到瓶颈的第一步是:

  • Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
  • If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
  • Some async io library (Twisted?) might help too.
  • 从您的代码中删除任何处理。只需读取和写入数据并测量速度。如果只是读写文件太慢,那不是你的代码的问题。
  • 如果只是读写已经很慢,请尝试使用多个磁盘。你在同时阅读和写作。同一张盘?如果是,请尝试使用不同的光盘并重试。
  • 一些异步 io 库(Twisted?)也可能有帮助。

If you figured out the exact problem, ask again for optimizations of that problem.

如果您找出了确切的问题,请再次要求对该问题进行优化。

回答by Didier Trosset

As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?

由于您似乎不受 CPU 的限制,而是受 I/O 的限制,您是否尝试过对open?的第三个参数进行一些变体?

Indeed, this third parameter can be used to give the buffer size to be used for file operations!

实际上,这第三个参数可用于给出用于文件操作的缓冲区大小!

Simply writing open( "filepath", "r", 16777216 )will use 16 MB buffers when reading from the file. It must help.

open( "filepath", "r", 16777216 )从文件读取时,简单写入将使用 16 MB 缓冲区。它必须有所帮助。

Use the same for the output file, and measure/compare with identical file for the rest.

对输出文件使用相同的文件,对其余文件使用相同的文件进行测量/比较。

Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.

注意:这与其他人建议的优化类型相同,但您可以在此处免费获得,无需更改代码,无需自己缓冲。

回答by jthill

Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.

既然您只提到节省空间是一种好处,那么您是否有某种原因不能只存储 gzip 压缩的文件?这应该可以节省 70% 以上的数据。或者,如果随机访问仍然很重要,请考虑使用 NTFS 来压缩文件。在其中任何一个之后,您将获得更多的 I/O 时间节省。

More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

更重要的是,您只能获得 3.4GB/小时的数据在哪里?这大约是 USBv1 的速度。

回答by Gene

I'll add this answer to explain whybuffering makes sense and also offer one more solution

我将添加此答案以解释为什么缓冲有意义并提供另一种解决方案

You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO?shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.

你得到了惊人的糟糕表现。这篇文章是否可以加速python IO?显示 10 GB 的读取应该需要大约 3 分钟。顺序写入是相同的速度。所以你错过了 30 倍,你的性能目标仍然比应该可能的慢 10 倍。

Almost certainly this kind of disparity lies in the number of head seeksthe disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.

几乎可以肯定,这种差异在于磁盘正在执行的磁头搜索次数。头部搜索需要几毫秒。单个搜索对应于几兆字节的顺序读写。非常昂贵。同一磁盘上的复制​​操作需要在输入和输出之间寻找。如前所述,减少寻道的一种方法是以在写入磁盘之前读取许多兆字节的方式进行缓冲,反之亦然。如果你能说服 python io 系统这样做,那就太好了。否则,您可以将行读取并处理到字符串数组中,然后在可能 50 mb 的输出准备就绪后写入。这个大小意味着就数据传输本身而言,搜索将导致 <10% 的性能损失。

The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.

另一种完全消除输入和输出文件之间的查找的非常简单的方法是使用具有两个物理磁盘和完全独立的 io 通道的机器。从一输入。输出到其他。如果您要进行大量大文件转换,最好有一台具有此功能的机器。