SQL 为什么遍历大型 Django QuerySet 会消耗大量内存?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4222176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 08:21:00  来源:igfitidea点击:

Why is iterating through a large Django QuerySet consuming massive amounts of memory?

sqldjangopostgresqldjango-orm

提问by davidchambers

The table in question contains roughly ten million rows.

有问题的表包含大约一千万行。

for event in Event.objects.all():
    print event

This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.

这会导致内存使用量稳步增加到 4 GB 左右,此时行会快速打印。打印第一行之前的漫长延迟让我感到惊讶 - 我预计它几乎可以立即打印。

I also tried Event.objects.iterator()which behaved the same way.

我也试过Event.objects.iterator()哪个表现相同。

I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).

我不明白 Django 正在将什么加载到内存中,或者为什么要这样做。我希望 Django 在数据库级别遍历结果,这意味着结果将以大致恒定的速率打印(而不是在漫长的等待后一次全部打印)。

What have I misunderstood?

我误解了什么?

(I don't know whether it's relevant, but I'm using PostgreSQL.)

(我不知道它是否相关,但我使用的是 PostgreSQL。)

采纳答案by eternicode

Nate C was close, but not quite.

Nate C 很接近,但不完全。

From the docs:

文档

You can evaluate a QuerySet in the following ways:

  • Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:

    for e in Entry.objects.all():
        print e.headline
    

您可以通过以下方式评估 QuerySet:

  • 迭代。QuerySet 是可迭代的,它会在您第一次对其进行迭代时执行其数据库查询。例如,这将打印数据库中所有条目的标题:

    for e in Entry.objects.all():
        print e.headline
    

So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.

因此,当您第一次进入该循环并获取查询集的迭代形式时,将一次性检索您的 1000 万行。您所经历的等待是 Django 加载数据库行并为每个行创建对象,然后返回您可以实际迭代的内容。然后您将所有内容都保存在内存中,结果就会溢出。

From my reading of the docs, iterator()does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.

从我对文档的阅读来看,iterator()无非是绕过 QuerySet 的内部缓存机制。我认为逐一执行可能有意义,但相反,这将需要对您的数据库进行 1000 万次单独点击。也许并不是那么理想。

Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:

有效地迭代大型数据集是我们还没有完全正确的事情,但是有一些片段您可能会发现对您的目的有用:

回答by mpaf

Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:

可能不是更快或最有效,但作为现成的解决方案,为什么不使用此处记录的 django 核心的分页器和页面对象:

https://docs.djangoproject.com/en/dev/topics/pagination/

https://docs.djangoproject.com/en/dev/topics/pagination/

Something like this:

像这样的东西:

from django.core.paginator import Paginator
from djangoapp.models import model

paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can 
                                                 # change this to desired chunk size

for page in range(1, paginator.num_pages + 1):
    for row in paginator.page(page).object_list:
        # here you can do whatever you want with the row
    print "done processing page %s" % page

回答by Luke Moore

Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:

Django 的默认行为是在评估查询时缓存 QuerySet 的整个结果。您可以使用 QuerySet 的迭代器方法来避免这种缓存:

for event in Event.objects.all().iterator():
    print event

https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.

iterator() 方法评估查询集,然后直接读取结果,而无需在 QuerySet 级别进行缓存。在迭代大量您只需要访问一次的对象时,此方法可带来更好的性能并显着减少内存。请注意,缓存仍然在数据库级别完成。

Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.

使用 iterator() 减少了我的内存使用量,但它仍然高于我的预期。使用 mpaf 建议的分页器方法使用的内存要少得多,但对我的测试用例来说要慢 2-3 倍。

from django.core.paginator import Paginator

def chunked_iterator(queryset, chunk_size=10000):
    paginator = Paginator(queryset, chunk_size)
    for page in range(1, paginator.num_pages + 1):
        for obj in paginator.page(page).object_list:
            yield obj

for event in chunked_iterator(Event.objects.all()):
    print event

回答by nate c

This is from the docs: http://docs.djangoproject.com/en/dev/ref/models/querysets/

这是来自文档:http: //docs.djangoproject.com/en/dev/ref/models/querysets/

No database activity actually occurs until you do something to evaluate the queryset.

在您执行某些操作来评估查询集之前,实际上不会发生数据库活动。

So when the print eventis run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.

因此,当print event运行查询时会触发(根据您的命令进行全表扫描。)并加载结果。您要求所有对象,如果不获取所有对象,就无法获取第一个对象。

But if you do something like:

但是,如果您执行以下操作:

Event.objects.all()[300:900]

http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets

http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets

Then it will add offsets and limits to the sql internally.

然后它会在内部向 sql 添加偏移量和限制。

回答by Frank Heikens

For large amounts of records, a database cursorperforms even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.

对于大量记录,数据库游标的性能甚至更好。您确实需要 Django 中的原始 SQL,Django-cursor 与 SQL cursur 不同。

The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.

Nate C 建议的 LIMIT - OFFSET 方法可能足以满足您的情况。对于大量数据,它比游标慢,因为它必须一遍又一遍地运行相同的查询,并且必须跳过越来越多的结果。

回答by Kracekumar

Django doesn't have good solution for fetching large items from database.

Django 没有从数据库中获取大项目的好的解决方案。

import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)

for index, eid in enumerate(eids):
    event = Event.object.get(id=eid)
    # do necessary work with event
    if index % 100 == 0:
       gc.collect()
       print("completed 100 items")

values_listcan be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.

values_list可用于获取数据库中的所有 id,然后分别获取每个对象。一段时间后,大对象将在内存中创建,并且不会被垃圾收集,直到 for 循环退出。上面的代码在每消耗 100 个项目后进行手动垃圾收集。

回答by line break

Because that way objects for a whole queryset get loaded in memory all at once. You need to chunk up your queryset into smaller digestible bits. The pattern to do this is called spoonfeeding. Here's a brief implementation.

因为这样整个查询集的对象会一次性加载到内存中。您需要将查询集分成更小的可消化位。这样做的模式称为勺子喂养。下面是一个简单的实现。

def spoonfeed(qs, func, chunk=1000, start=0):
    ''' Chunk up a large queryset and run func on each item.

    Works with automatic primary key fields.

    chunk -- how many objects to take on at once
    start -- PK to start from

    >>> spoonfeed(Spam.objects.all(), nom_nom)
    '''
    while start < qs.order_by('pk').last().pk:
        for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
            yeild func(o)
        start += chunk

To use this you write a function that does operations on your object:

要使用它,您需要编写一个对对象执行操作的函数:

def set_population_density(town):
    town.population_density = calculate_population_density(...)
    town.save()

and than run that function on your queryset:

然后在您的查询集上运行该函数:

spoonfeed(Town.objects.all(), set_population_density)

This can be further improved on with multiprocessing to execute funcon multiple objects in parallel.

这可以通过多处理进一步改进func以并行执行多个对象。

回答by danius

Here a solution including len and count:

这是一个包括 len 和 count 的解决方案:

class GeneratorWithLen(object):
    """
    Generator that includes len and count for given queryset
    """
    def __init__(self, generator, length):
        self.generator = generator
        self.length = length

    def __len__(self):
        return self.length

    def __iter__(self):
        return self.generator

    def __getitem__(self, item):
        return self.generator.__getitem__(item)

    def next(self):
        return next(self.generator)

    def count(self):
        return self.__len__()

def batch(queryset, batch_size=1024):
    """
    returns a generator that does not cache results on the QuerySet
    Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size

    :param batch_size: Size for the maximum chunk of data in memory
    :return: generator
    """
    total = queryset.count()

    def batch_qs(_qs, _batch_size=batch_size):
        """
        Returns a (start, end, total, queryset) tuple for each batch in the given
        queryset.
        """
        for start in range(0, total, _batch_size):
            end = min(start + _batch_size, total)
            yield (start, end, total, _qs[start:end])

    def generate_items():
        queryset.order_by()  # Clearing... ordering by id if PK autoincremental
        for start, end, total, qs in batch_qs(queryset):
            for item in qs:
                yield item

    return GeneratorWithLen(generate_items(), total)

Usage:

用法:

events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
    # Do something with the Event

回答by Tho

I usually use raw MySQL raw query instead of Django ORM for this kind of task.

对于此类任务,我通常使用原始 MySQL 原始查询而不是 Django ORM。

MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.

MySQL 支持流模式,因此我们可以安全快速地循环所有记录,而不会出现内存不足错误。

import MySQLdb
db_config = {}  # config your db here
connection = MySQLdb.connect(
        host=db_config['HOST'], user=db_config['USER'],
        port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection)  # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
    record = cursor.fetchone()
    if record is None:
        break
    # Do something with record here

cursor.close()
connection.close()

Ref:

参考:

  1. Retrieving million of rows from MySQL
  2. How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once
  1. 从 MySQL 中检索数百万行
  2. MySQL 结果集流如何执行与一次获取整个 JDBC ResultSet