Python 内存上限?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4285185/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 15:04:12  来源:igfitidea点击:

Upper memory limit?

pythonmemory

提问by Harpal

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.

python的内存有限制吗?我一直在使用 python 脚本来计算最小 150mb 大文件的平均值。

Depending on the size of the file I sometimes encounter a MemoryError.

根据文件的大小,我有时会遇到MemoryError.

Can more memory be assigned to the python so I don't encounter the error?

是否可以为 python 分配更多内存,这样我就不会遇到错误?



EDIT: Code now below

编辑:现在下面的代码

NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb

注意:文件大小可能有很大差异(最多 20GB),文件的最小大小为 150mb

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")

files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]

for u in files:
    line = u.readlines()
    list_of_lines = []
    for i in line:
        values = i.split('\t')
        list_of_lines.append(values)

    count = 0
    for j in list_of_lines:
        count +=1

    for k in range(0,count):
        list_of_lines[k].remove('\n')

    length = len(list_of_lines[0])
    print_counter = 4

    for o in range(0,length):
        total = 0
        for p in range(0,count):
            number = float(list_of_lines[p][o])
            total = total + number
        average = total/count
        print average
        if print_counter == 4:
            file_write.write(str(average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

采纳答案by martineau

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.

(这是我的第三个答案,因为我误解了你的代码在我的原始代码中的作用,然后在我的第二个中犯了一个小但至关重要的错误 - 希望三个是一个魅力。

Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.

编辑:由于这似乎是一个流行的答案,因此多年来我进行了一些修改以改进其实施 - 大多数不是太大。因此,如果人们将其用作模板,它将提供更好的基础。

As others have pointed out, your MemoryErrorproblem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.

正如其他人指出的那样,您的MemoryError问题很可能是因为您试图将大文件的全部内容读入内存,然后通过创建字符串列表有效地将所需的内存量加倍每行的值。

Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.

Python 的内存限制取决于您的计算机和操作系统有多少物理内存和虚拟内存磁盘空间。即使您没有完全使用它并且您的程序“有效”,使用它也可能不切实际,因为它需要太长时间。

Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.

无论如何,避免这种情况的最明显方法是一次处理每个文件一行,这意味着您必须以增量方式进行处理。

To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.

为实现此目的,将保留每个字段的运行总计列表。完成后,可以通过将相应的总值除以读取的总行数来计算每个字段的平均值。完成后,可以打印出这些平均值,并将其中一些写入输出文件之一。我也有意识地努力使用非常具有描述性的变量名称,以使其易于理解。

try:
    from itertools import izip_longest
except ImportError:    # Python 3
    from itertools import zip_longest as izip_longest

GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
                    "A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w')  # left in, but nothing written

for file_name in input_file_names:
    with open(file_name, 'r') as input_file:
        print('processing file: {}'.format(file_name))

        totals = []
        for count, fields in enumerate((line.split('\t') for line in input_file), 1):
            totals = [sum(values) for values in
                        izip_longest(totals, map(float, fields), fillvalue=0)]
        averages = [total/count for total in totals]

        for print_counter, average in enumerate(averages):
            print('  {:9.4f}'.format(average))
            if print_counter % GROUP_SIZE == 0:
                file_write.write(str(average)+'\n')

file_write.write('\n')
file_write.close()
mutation_average.close()

回答by P?r Wieslander

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.

不,对于 Python 应用程序的内存使用没有特定于 Python 的限制。我经常使用可能使用数 GB 内存的 Python 应用程序。最有可能的是,您的脚本实际使用的内存比您正在运行的机器上的可用内存多。

In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.

在这种情况下,解决方案是重写脚本以提高内存效率,或者如果脚本已经优化以最小化内存使用,则添加更多物理内存。

Edit:

编辑:

Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.

您的脚本会一次性将文件的全部内容读入内存 ( line = u.readlines())。由于您正在处理大小高达 20 GB 的文件,因此除非您的机器中有大量内存,否则使用这种方法会出现内存错误。

A better approach would be to read the files one line at a time:

更好的方法是一次读取一行文件:

for u in files:
     for line in u: # This will iterate over each line in the file
         # Read values from the line, do necessary calculations

回答by Tim Pietzcker

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.

您正在将整个文件读入内存 ( line = u.readlines()),如果文件太大,当然会失败(并且您说有些文件高达 20 GB),所以这就是您的问题。

Better iterate over each line:

更好地迭代每一行:

for current_line in u:
    do_something_with(current_line)

is the recommended approach.

是推荐的方法。

Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a forloop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.

在脚本的后面,您会做一些非常奇怪的事情,例如首先对列表中的所有项目进行计数,然后for在该计数范围内构建一个循环。为什么不直接遍历列表呢?你的脚本的目的是什么?我的印象是这可以更容易地完成。

This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.

这是像 Python 这样的高级语言的优势之一(而不是 C,你必须自己做这些内务任务):让 Python 为你处理迭代,只在内存中收集你实际需要的东西任何给定时间的记忆。

Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csvmodulewhich will handle all the splitting, removing of \ns etc. for you.

此外,由于您似乎正在处理 TSV 文件(制表符分隔值),您应该查看将为您处理所有拆分、删除s 等操作的csv模块\n

回答by Micha? Niklas

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about

Python 可以使用其环境可用的所有内存。使用大约后,我的简单“内存测试”在 ActiveState Python 2.6 上崩溃

1959167 [MiB]

On jython 2.5 it crashes earlier:

在 jython 2.5 上它更早崩溃:

 239000 [MiB]

probably I can configure Jython to use more memory (it uses limits from JVM)

也许我可以将 Jython 配置为使用更多内存(它使用来自 JVM 的限制)

Test app:

测试应用:

import sys

sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
    fill_size = 1003
if sys.version.startswith('3'):
    fill_size = 497
print(fill_size)
MiB = 0
while True:
    s = str(i).zfill(fill_size)
    sl.append(s)
    if i == 0:
        try:
            sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
        except AttributeError:
            pass
    i += 1
    if i % 1024 == 0:
        MiB += 1
        if MiB % 25 == 0:
            sys.stderr.write('%d [MiB]\n' % (MiB))


In your app you read whole file at once. For such big files you should read the line by line.

在您的应用程序中,您可以一次读取整个文件。对于如此大的文件,您应该逐行阅读。

回答by John Machin

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.

您不仅要将每个文件的全部读入内存,而且还费力地将信息复制到名为list_of_lines.

You have a secondary problem: your choices of variable names severely obfuscate what you are doing.

您还有一个次要问题:您对变量名称的选择严重混淆了您正在做的事情。

Here is your script rewritten with the readlines() caper removed and with meaningful names:

这是您的脚本重写后删除了 readlines() 雀跃并使用了有意义的名称:

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
    table = []
    for aline in afile:
        values = aline.split('\t')
        values.remove('\n') # why?
        table.append(values)
    row_count = len(table)
    row0length = len(table[0])
    print_counter = 4
    for column_index in range(row0length):
        column_total = 0
        for row_index in range(row_count):
            number = float(table[row_index][column_index])
            column_total = column_total + number
        column_average = column_total/row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.

很明显 (1) 您正在计算列平均值 (2) 混淆导致其他一些人认为您正在计算行平均值。

As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.

在计算列平均值时,在每个文件结束之前都不需要输出,实际需要的额外内存量与列数成正比。

Here is a revised version of the outer loop code:

这是外循环代码的修订版:

for afile in files:
    for row_count, aline in enumerate(afile, start=1):
        values = aline.split('\t')
        values.remove('\n') # why?
        fvalues = map(float, values)
        if row_count == 1:
            row0length = len(fvalues)
            column_index_range = range(row0length)
            column_totals = fvalues
        else:
            assert len(fvalues) == row0length
            for column_index in column_index_range:
                column_totals[column_index] += fvalues[column_index]
    print_counter = 4
    for column_index in column_index_range:
        column_average = column_totals[column_index] / row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1