Python 分块读取文件 - RAM 使用,从二进制文件中读取字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17056382/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read file in chunks - RAM-usage, read Strings from binary files
提问by xph
i'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
我想了解在 python 中读取大文件时这种方法在 RAM 使用方面的差异。
Version 1, found here on stackoverflow:
版本 1,在 stackoverflow 上找到:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, i used this before i found the code above:
版本 2,我在找到上面的代码之前使用了它:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, pieceis getting new content on every cycle, so i thought this would do the job to notload the complete file into memory..?
该文件在两个版本中都被部分读取。并且可以处理当前的片段。在第二个示例中,piece在每个周期获取新内容,所以我认为这可以完成不将完整文件加载到内存中的工作..?
But i don't really understand what yielddoes, and i'm pretty sure i got something wrong here. Could anyone explain that to me?
但我真的不明白是什么yield,我很确定我在这里出了什么问题。有人可以向我解释一下吗?
There is something else that puzzles me, besides of the method used:
除了使用的方法之外,还有其他让我感到困惑的事情:
The content of the piece i read is defined by the chunk-size, 1KB in the examples above. But... what if i need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?
我读取的片段的内容由块大小定义,在上面的示例中为 1KB。但是...如果我需要在文件中查找字符串怎么办?像"ThisIsTheStringILikeToFind"什么?
Depending on where in the file the String occurs, it could be that one piece contains the part "ThisIsTheStr"- and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.
根据字符串在文件中出现的位置,可能是一个片段包含该部分"ThisIsTheStr"- 而下一个片段将包含"ingILikeToFind". 使用这种方法不可能检测到任何片段中的整个字符串。
Is there a way to read a file in chunks - but somehow care about such strings?
有没有办法分块读取文件 - 但不知何故关心这样的字符串?
Any help or idea is welcome,
欢迎任何帮助或想法,
greets!
问候!
采纳答案by AJMansfield
yieldis the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.
yield是python中用于生成器表达式的关键字。这意味着下次调用(或迭代)该函数时,执行将在您上次调用它时停止的确切点重新开始。这两个函数的行为相同;唯一的区别是第一个使用比第二个多一点的调用堆栈空间。然而,第一个更可重用,所以从程序设计的角度来看,第一个实际上更好。
EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read()or process_data()throws an exception. In order to have the second one work properly, you need to modify it like so:
编辑:另外,另一个区别是第一个将在读取所有数据后停止读取,它应该如此,但第二个只会停止一次f.read()或process_data()引发异常。为了让第二个正常工作,您需要像这样修改它:
f = open(file, 'rb')
while True:
piece = f.read(1024)
if not piece:
break
process_data(piece)
f.close()
回答by martineau
I think probably the best and most idiomatic way to do this would be to use the built-in iter()function with a sentinelvalue to create and use an iterable as shown below. Note that the last chunk might be less that the requested chunk size if the file size isn't an exact multiple of it.
我认为最好和最惯用的方法可能是使用iter()带有哨兵值的内置函数来创建和使用可迭代对象,如下所示。请注意,如果文件大小不是它的精确倍数,则最后一个块可能小于请求的块大小。
from functools import partial
CHUNK_SIZE = 1024
filename = 'testfile.dat'
with open(filename, 'rb') as file:
for chunk in iter(partial(file.read, CHUNK_SIZE), b''):
process_data(chunk)

