提高 PostgreSQL 中的 OFFSET 性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6618366/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 23:07:38  来源:igfitidea点击:

Improving OFFSET performance in PostgreSQL

databasepostgresqlquery-optimization

提问by James Tauber

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.

我有一张桌子,我正在 LIMIT 和 OFFSET 之前对其进行 ORDER BY 以进行分页。

Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.

在 ORDER BY 列上添加索引会对性能产生巨大影响(与小的 LIMIT 结合使用时)。在一个 500,000 行的表上,只要有一个小的 LIMIT,我看到添加索引可以提高 10,000 倍。

However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.

但是,索引对高偏移量(即我分页中的后面页面)没有影响。这是可以理解的:b 树索引可以很容易地从头开始按顺序迭代,但不会找到第 n 项。

It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.

似乎有帮助的是计数的 b-tree index,但我不知道 PostgreSQL 支持这些。还有其他解决方案吗?似乎针对大的 OFFSET 进行优化(尤其是在分页用例中)并不少见。

Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."

不幸的是,PostgreSQL 手册只是简单地说“由 OFFSET 子句跳过的行仍然必须在服务器内部计算;因此大的 OFFSET 可能效率低下。”

回答by Mike Ivanov

You might want a computed index.

您可能需要一个计算索引。

Let's create a table:

让我们创建一个表:

create table sales(day date, amount real);

And fill it with some random stuff:

并用一些随机的东西填充它:

insert into sales 
    select current_date + s.a as day, random()*100 as amount
    from generate_series(1,20);

Index it by day, nothing special here:

按天索引,这里没什么特别的:

create index sales_by_day on sales(day);

Create a row position function. There are other approaches, this one is the simplest:

创建行位置函数。还有其他方法,这个是最简单的:

create or replace function sales_pos (date) returns bigint 
   as 'select count(day) from sales where day <= ;' 
   language sql immutable;

Check if it works (don't call it like this on large datasets though):

检查它是否有效(但不要在大型​​数据集上这样称呼它):

select sales_pos(day), day, amount from sales;

     sales_pos |    day     |  amount  
    -----------+------------+----------
             1 | 2011-07-08 |  41.6135
             2 | 2011-07-09 |  19.0663
             3 | 2011-07-10 |  12.3715
    ..................

Now the tricky part: add another index computed on the sales_pos function values:

现在是棘手的部分:添加根据 sales_pos 函数值计算的另一个索引:

create index sales_by_pos on sales using btree(sales_pos(day));

Here is how you use it. 5 is your "offset", 10 is the "limit":

这是你如何使用它。5 是您的“偏移量”,10 是“限制”:

select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;

        day     | amount  
    ------------+---------
     2011-07-12 | 94.3042
     2011-07-13 | 12.9532
     2011-07-14 | 74.7261
    ...............

It is fast, because when you call it like this, Postgres uses precalculated values from the index:

它很快,因为当你这样调用它时,Postgres 使用来自索引的预先计算的值:

explain select * from sales 
  where sales_pos(day) >= 5 and sales_pos(day) < 5+10;

                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Index Scan using sales_by_pos on sales  (cost=0.50..8.77 rows=1 width=8)
       Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))

Hope it helps.

希望能帮助到你。

回答by Flimzy

I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.

我对“计数的 b 树索引”一无所知,但是我们在应用程序中为帮助解决此问题所做的一件事是将我们的查询分成两个,可能使用子查询。如果你已经这样做了,我很抱歉浪费你的时间。

SELECT *
FROM massive_table
WHERE id IN (
    SELECT id
    FROM massive_table
    WHERE ...
    LIMIT 50
    OFFSET 500000
);

The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the idcolumn.

这里的优点是,虽然它仍然必须计算所有内容的正确顺序,但它不会对整行进行排序——仅对id列进行排序。

回答by Mike Sherrill 'Cat Recall'

It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.

似乎针对大的 OFFSET 进行优化(尤其是在分页用例中)并不少见。

It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.

这对我来说似乎有点不寻常。大多数人,大多数时候,似乎不会浏览很多页面。这是我支持的东西,但不会努力优化。

But anyway . . .

但无论如何 。. .

Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seento the WHERE clause in some appropriate way.

由于您的应用程序代码知道它已经看到了哪些有序值,因此它应该能够通过在 WHERE 子句中排除这些值来减少结果集并减少偏移量。假设您对单个列进行排序,并且按升序排列,您的应用程序代码可以存储页面上的最后一个值,然后AND your-ordered-column-name > last-value-seen以某种适当的方式添加到 WHERE 子句中。

回答by Rolintocour

Instead of using an OFFSET, a very efficient trick is to use a temporary table:

一个非常有效的技巧是使用临时表,而不是使用 OFFSET:

CREATE  TEMPORARY TABLE just_index AS
SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
FROM mytable;

For 10 000 000 rows it needs about 10s to be created. Then you want to use SELECT or UPDATE your table, you simply:

对于 10 000 000 行,它需要大约 10 秒来创建。然后你想使用 SELECT 或 UPDATE 你的表,你只需:

SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID

Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)

仅使用 just_index 过滤 mytable 使用 INNER JOIN 比使用 WHERE myID IN (SELECT ...) 更有效(在我的情况下)

This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes

这样您就不必存储最后一个 myId 值,您只需用 WHERE 子句替换偏移量,该子句使用索引

回答by user2928872

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one. i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination. In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":

最近我解决了一个这样的问题,我写了一篇关于如何面对这个问题的博客。非常喜欢,希望对大家有所帮助。我使用部分获取的惰性列表方法。i 将查询的限制和偏移量或分页替换为手动分页。在我的示例中,select 返回 1000 万条记录,我获取它们并将它们插入到“临时表”中:

create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
 join table2 t2 on (t2.fieldpk = t1.fieldpk)
 join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;

after that, i can paginate without count each row but using the sequence assigned:

之后,我可以不计算每一行而是使用分配的序列进行分页:

select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000

From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:

从 Java 的角度来看,我通过使用惰性列表的部分获取来实现此分页。这是一个从抽象列表扩展并实现 get() 方法的列表。get 方法可以使用数据访问接口继续获取下一组数据并释放内存堆:

@Override
public E get(int index) {
  if (bufferParcial.size() <= (index - lastIndexRoulette))
  {
    lastIndexRoulette = index;
    bufferParcial.removeAll(bufferParcial);
    bufferParcial = new ArrayList<E>();
        bufferParcial.addAll(daoInterface.getBufferParcial());
    if (bufferParcial.isEmpty())
    {
        return null;
    }

  }
  return bufferParcial.get(index - lastIndexRoulette);<br>
}

by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.

另一方面,数据访问接口使用查询进行分页,并实现一种方法逐步迭代,每25000条记录完成一次。

results for this approach can be seen here http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

这种方法的结果可以在这里看到 http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html