SQL ActiveRecord find_each 结合limit和order

Question

提问by Avishai

I'm trying to run a query of about 50,000 records using ActiveRecord's find_eachmethod, but it seems to be ignoring my other parameters like so:

我正在尝试使用 ActiveRecord 的find_each方法运行大约 50,000 条记录的查询，但它似乎忽略了我的其他参数，如下所示：

Thing.active.order("created_at DESC").limit(50000).find_each {|t| puts t.id }

Instead of stopping at 50,000 I'd like and sorting by created_at, here's the resulting query that gets executed over the entiredataset:

不是在 50,000 处停止并按排序created_at，而是在整个数据集上执行的结果查询：

Thing Load (198.8ms)  SELECT "things".* FROM "things" WHERE "things"."active" = 't' AND ("things"."id" > 373343) ORDER BY "things"."id" ASC LIMIT 1000

Is there a way to get similar behavior to find_eachbut with a total max limit and respecting my sort criteria?

有没有办法获得类似的行为，find_each但具有总最大限制并尊重我的排序标准？

Answer 1

采纳答案by Dirk Geurs

The documentationsays that find_each and find_in_batches don't retain sort order and limit because:

文档说 find_each 和 find_in_batches 不保留排序顺序和限制，因为：

Sorting ASC on the PK is used to make the batch ordering work.
Limit is used to control the batch sizes.

PK 上的排序 ASC 用于使批量排序工作。
限制用于控制批量大小。

You could write your own version of this function like @rorra did. But you can get into trouble when mutating the objects. If for example you sort by created_at and save the object it might come up again in one of the next batches. Similarly you might skip objects because the order of results has changed when executing the query to get the next batch. Only use that solution with read only objects.

您可以像@rorra 那样编写自己的此函数版本。但是在改变对象时可能会遇到麻烦。例如，如果您按 created_at 排序并保存对象，它可能会在下一批中再次出现。同样，您可能会跳过对象，因为在执行查询以获取下一批时结果的顺序已更改。仅对只读对象使用该解决方案。

Now my primary concern was that I didn't want to load 30000+ objects into memory at once. My concern was not the execution time of the query itself. Therefore I used a solution that executes the original query but only caches the ID's. It then divides the array of ID's into chunks and queries/creates the objects per chunk. This way you can safely mutate the objects because the sort order is kept in memory.

现在我主要担心的是我不想一次将 30000 多个对象加载到内存中。我关心的不是查询本身的执行时间。因此，我使用了一种执行原始查询但仅缓存 ID 的解决方案。然后它将 ID 数组划分为块，并查询/创建每个块的对象。这样您就可以安全地改变对象，因为排序顺序保存在内存中。

Here is a minimal example similar to what I did:

这是一个与我所做的类似的最小示例：

batch_size = 512
ids = Thing.order('created_at DESC').pluck(:id) # Replace .order(:created_at) with your own scope
ids.each_slice(batch_size) do |chunk|
    Thing.find(chunk, :order => "field(id, #{chunk.join(',')})").each do |thing|
      # Do things with thing
    end
end

The trade-offs to this solution are:

此解决方案的权衡是：

The complete query is executed to get the ID's
An array of all the ID's is kept in memory
Uses the MySQL specific FIELD() function

执行完整的查询以获取 ID
所有 ID 的数组都保存在内存中
使用 MySQL 特定的 FIELD() 函数

Hope this helps!

希望这可以帮助！

Answer 2

回答by rorra

find_eachuses find_in_batchesunder the hood.

find_each在幕后使用find_in_batches。

Its not possible to select the order of the records, as described in find_in_batches, is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work.

无法选择记录的顺序，如find_in_batches 中所述，在主键（“id ASC”）上自动设置为升序以使批量排序工作。

However, the criteria is applied, what you can do is:

但是，应用了标准，您可以做的是：

Thing.active.find_each(batch_size: 50000) { |t| puts t.id }

Regarding the limit, it wasn't implemented yet: https://github.com/rails/rails/pull/5696

关于限制，还没有实现：https: //github.com/rails/rails/pull/5696

Answering to your second question, you can create the logic yourself:

回答你的第二个问题，你可以自己创建逻辑：

total_records = 50000
batch = 1000
(0..(total_records - batch)).step(batch) do |i|
  puts Thing.active.order("created_at DESC").offset(i).limit(batch).to_sql
end

Answer 3

回答by Thomas Klemm

Retrieving the idsfirst and processing the in_groups_of

检索第ids一个并处理in_groups_of

ordered_photo_ids = Photo.order(likes_count: :desc).pluck(:id)

ordered_photo_ids.in_groups_of(1000, false).each do |photo_ids|
  photos = Photo.order(likes_count: :desc).where(id: photo_ids)

  # ...
end

It's important to also add the ORDER BYquery to the inner call.

将ORDER BY查询添加到内部调用也很重要。

Answer 4

回答by x-yuri

One option is to put an implementation tailored for your particular model into the model itself (speaking of which, idis usually a better choice for ordering records, created_atmay have duplicates):

一种选择是将为您的特定模型量身定制的实现放入模型本身（说到这，id通常是排序记录的更好选择，created_at可能有重复）：

class Thing < ActiveRecord::Base
  def self.find_each_desc limit
    batch_size = 1000
    i = 1
    records = self.order(created_at: :desc).limit(batch_size)
    while records.any?
      records.each do |task|
        yield task, i
        i += 1
        return if i > limit
      end
      records = self.order(created_at: :desc).where('id < ?', records.last.id).limit(batch_size)
    end
  end
end

Or else you can generalize things a bit, and make it work for all the models:

或者，您可以将事情概括一下，并使其适用于所有模型：

lib/active_record_extensions.rb:

lib/active_record_extensions.rb：

ActiveRecord::Batches.module_eval do
  def find_each_desc limit
    batch_size = 1000
    i = 1
    records = self.order(id: :desc).limit(batch_size)
    while records.any?
      records.each do |task|
        yield task, i
        i += 1
        return if i > limit
      end
      records = self.order(id: :desc).where('id < ?', records.last.id).limit(batch_size)
    end
  end
end

ActiveRecord::Querying.module_eval do
  delegate :find_each_desc, :to => :all
end

config/initializers/extensions.rb:

config/initializers/extensions.rb：

require "active_record_extensions"

P.S. I'm putting the code in files according to this answer.

PS 我正在根据这个答案将代码放入文件中。

Answer 5

回答by Lev Lukomsky

You can iterate backwards by standard ruby iterators:

您可以通过标准 ruby 迭代器向后迭代：

Thing.last.id.step(0,-1000) do |i|
  Thing.where(id: (i-1000+1)..i).order('id DESC').each do |thing|
    #...
  end
end

Note: +1is because BETWEEN which will be in query includes both bounds but we need include only one.

注意：+1是因为查询中的 BETWEEN 包括两个边界，但我们只需要包括一个。

Sure, with this approach there could be fetched less than 1000 records in batch because some of them are deleted already but this is ok in my case.

当然，使用这种方法可以批量提取少于 1000 条记录，因为其中一些记录已经被删除，但在我的情况下这是可以的。

Answer 6

回答by Martin

You can try ar-as-batchesGem.

您可以尝试ar-as-batchesGem。

From their documentationyou can do something like this

从他们的文档中你可以做这样的事情

Users.where(country_id: 44).order(:joined_at).offset(200).as_batches do |user|
  user.party_all_night!
end

Answer 7

回答by Moemars

I was looking for the same behaviour and thought up of this solution. This DOES NOT order by created_at but I thought I would post anyways.

我一直在寻找相同的行为并想到了这个解决方案。这不是由 created_at 订购的，但我想我还是会发布。

max_records_to_retrieve = 50000
last_index = Thing.count
start_index = [(last_index - max_records_to_retrieve), 0].max
Thing.active.find_each(:start => start_index) do |u|
    # do stuff
end

Drawbacks of this approach: - You need 2 queries (first one should be fast) - This guarantees a max of 50K records but if ids are skipped you will get less.

这种方法的缺点： - 你需要 2 个查询（第一个应该很快） - 这保证最多 50K 记录，但如果跳过 id，你会得到更少。

Answer 8

回答by tsauerwein

As remarked by @Kirk in one of the comments, find_eachsupports limitas of version 5.1.0.

正如@Kirk 在其中一条评论中所说，从版本5.1.0 开始find_each支持。limit

Example from the changelog:

更改日志中的示例：

Post.limit(10_000).find_each do |post|
  # ...
end

The documentationsays:

该文件说：

Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit.

限制受到尊重，如果存在，则对批量大小没有要求：它可以小于、等于或大于限制。

(setting a custom order is still not supported though)

（虽然仍然不支持设置自定义订单）

Answer 9

回答by the_spectator

Adding find_in_batches_with_order did solve my usecase, where I was having ids already but need batching and ordering. It was inspired by @dirk-geurs solution

添加 find_in_batches_with_order 确实解决了我的用例，我已经有了 id 但需要批处理和排序。它的灵感来自@dirk-geurs 解决方案

# Create file config/initializers/find_in_batches_with_order.rb with follwing code.
ActiveRecord::Batches.class_eval do
  ## Only flat order structure is supported now
  ## example: [:forename, :surname] is supported but [:forename, {surname: :asc}] is not supported
  def find_in_batches_with_order(ids: nil, order: [], batch_size: 1000)
    relation = self
    arrangement = order.dup
    index = order.find_index(:id)

    unless index
      arrangement.push(:id)
      index = arrangement.length - 1
    end

    ids ||= relation.order(*arrangement).pluck(*arrangement).map{ |tupple| tupple[index] }
    ids.each_slice(batch_size) do |chunk_ids|
      chunk_relation = relation.where(id: chunk_ids).order(*order)
      yield(chunk_relation)
    end
  end
end

Leaving Gist here https://gist.github.com/the-spectator/28b1176f98cc2f66e870755bb2334545

将要点留在此处https://gist.github.com/the-spectator/28b1176f98cc2f66e870755bb2334545

Answer 10

回答by Hernan Damico

I had the same problem with a query with DISTINCT ONwhere you need an ORDER BYwith that field, so this is my approach with Postgres:

我在查询DISTINCT ON您需要ORDER BY该字段的位置时遇到了同样的问题，所以这是我使用 Postgres 的方法：

def filtered_model_ids
  Model.joins(:father_model)
       .select('DISTINCT ON (model.field) model.id')
       .order(:field)
       .map(&:id)
end

def processor
  filtered_model_ids.each_slice(BATCH_SIZE).lazy.each do |batch|
    Model.find(batch).each do |record|
      # Code
    end
  end
end

SQL ActiveRecord find_each 结合limit和order

提问by Avishai

采纳答案by Dirk Geurs

回答by rorra

回答by Thomas Klemm

回答by x-yuri

回答by Lev Lukomsky

回答by Martin

回答by Moemars

回答by tsauerwein

回答by the_spectator

回答by Hernan Damico

相关推荐

最近更新

标签

SQL ActiveRecord find_each 结合limit和order

提问by Avishai

采纳答案by Dirk Geurs

回答by rorra

回答by Thomas Klemm

回答by x-yuri

回答by Lev Lukomsky

回答by Martin

回答by Moemars

回答by tsauerwein

回答by the_spectator

回答by Hernan Damico

相关推荐

SQL Oracle：如何在 where 子句中按日期和时间进行过滤

SQL 如何将日期和时间组合成 db2 中的时间戳？

SQL PostgreSQL：在同一查询中使用计算列

在 SQL 查询中使用 1=2

相关推荐

最近更新

标签