使用 PostgreSQL 全文搜索排名的最佳方式

Question

提问by Timmmm

Following on from this answerI want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, andlimit to only matching queries.

继此答案之后，我想知道使用 PostgreSQL 的内置全文搜索的最佳方法是，如果我想按排名排序，并限制为仅匹配查询。

Let's assume a very simple table.

让我们假设一个非常简单的表。

CREATE TABLE pictures (
  id SERIAL PRIMARY KEY,
  title varchar(300),
  ...
)

or whatever. Now I want to search the titlefield. First I create an index:

管他呢。现在我想搜索该title字段。首先我创建一个索引：

CREATE INDEX pictures_title ON pictures 
  USING gin(to_tsvector('english', title));

Now I want to search for 'small dog'. This works:

现在我想搜索'small dog'. 这有效：

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), 'small dog'
       ) AS score
FROM pictures
ORDER BY score DESC

But what I really want is this:

但我真正想要的是：

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures 
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC

Or alternatively this (which doesn't work - can't use scorein the WHEREclause):

或者这个（这不起作用 - 不能score在WHERE子句中使用）：

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC

What's the best way to do this? My questions are many-fold:

做到这一点的最佳方法是什么？我的问题是多方面的：

If I use the version with repeated to_tsvector(...)will it call that twice, or is it smart enough to cache the results somehow?
Is there a way to do it without repeating the to_ts...function calls?
Is there a way to use scorein the WHEREclause at all?
If there is, would it be better to filter by score > 0or use the @@thing?

如果我重复使用该版本，to_tsvector(...)它会调用两次，还是它足够聪明以某种方式缓存结果？
有没有办法在不重复to_ts...函数调用的情况下做到这一点？
有没有办法score在WHERE子句中使用？
如果有，过滤score > 0或使用这个@@东西会更好吗？

Answer 1

回答by isapir

The use of the @@operator will utilize the full text GIN index, while the test for score > 0would not.

@@运算符的使用将使用全文 GIN 索引，而测试则score > 0不会。

I created a table as in the Question, but added a column named title_tsv:

我在问题中创建了一个表，但添加了一个名为的列title_tsv：

CREATE TABLE test_pictures (
  id        BIGSERIAL,
  title     text,
  title_tsv tsvector
);

CREATE INDEX ix_pictures_title_tsv ON test_pictures 
    USING gin(title_tsv);

I populated the table with some test data:

我用一些测试数据填充了表格：

INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data) 
FROM   some_table T;

Then I ran the previously accepted answer with explain analyze:

然后我运行了以前接受的答案explain analyze：

EXPLAIN ANALYZE 
SELECT  score, id, title
FROM (
    SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
        ,P.id        
        ,P.title
    FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;

And got the following. Note the execution time of 5,015 ms

并得到以下内容。注意5,015 ms的执行时间

QUERY PLAN                                                                                                                                    |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1)                             |
  Workers Planned: 2                                                                                                                          |
  Workers Launched: 2                                                                                                                         |
  ->  Sort  (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3)                                |
        Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC                                                      |
        Sort Method: quicksort  Memory: 131kB                                                                                                 |
        ->  Parallel Seq Scan on test_pictures p  (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
              Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision)                                 |
              Rows Removed by Filter: 497296                                                                                                  |
Planning time: 0.159 ms                                                                                                                       |
Execution time: 5015.664 ms                                                                                                                   |

Now compare that with the @@operator:

现在将其与@@运算符进行比较：

EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
    ,P.id
    ,P.title
FROM    test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;

And the results coming in with an execution time of about 29 ms:

结果以大约 29 毫秒的执行时间出现：

QUERY PLAN                                                                                                                                       |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1)                                        |
  Workers Planned: 2                                                                                                                             |
  Workers Launched: 2                                                                                                                            |
  ->  Sort  (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3)                                           |
        Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC                                                  |
        Sort Method: quicksort  Memory: 171kB                                                                                                    |
        ->  Parallel Bitmap Heap Scan on test_pictures p  (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
              Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                                |
              Heap Blocks: exact=625                                                                                                             |
              ->  Bitmap Index Scan on ix_pictures_title_tsv  (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1)  |
                    Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                            |
Planning time: 0.214 ms                                                                                                                          |
Execution time: 28.995 ms                                                                                                                        |

As you can see in the execution plan, the index ix_pictures_title_tsvwas used in the second query, but not in the first one, making the query with the @@operator a whopping 172 times faster!

正如您在执行计划中所看到的，索引ix_pictures_title_tsv在第二个查询中使用，但在第一个查询中没有使用，这使得使用@@运算符的查询快了 172 倍！

Answer 2

回答by Clodoaldo Neto

select *
from (
    SELECT
        pictures.id,
        ts_rank_cd(to_tsvector('english', pictures.title), 
        to_tsquery('small dog')) AS score
    FROM pictures
) s
WHERE score > 0
ORDER BY score DESC

Answer 3

回答by Wolph

If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?

如果我使用重复 to_tsvector(...) 的版本，它会调用两次，还是它足够聪明以某种方式缓存结果？

The best way to notice these things is to do a simple explain, although those can be hard to read.

注意这些事情的最好方法是做一个简单的解释，尽管这些解释可能很难阅读。

Long story short, yes, PostgreSQL is smart enough to reuse computed results.

长话短说，是的，PostgreSQL 足够聪明，可以重用计算结果。

Is there a way to do it without repeating the to_ts... function calls?

有没有办法在不重复 to_ts... 函数调用的情况下做到这一点？

What I usually do is add a tsvcolumn which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.

我通常做的是添加一tsv列作为文本搜索向量。如果您使用触发器进行此自动更新，它会立即为您提供易于访问的向量，但它还允许您通过使触发器具有选择性来选择性地更新搜索索引。

Is there a way to use score in the WHERE clause at all?

有没有办法在 WHERE 子句中使用 score ？

Yes, but not with that name. Alternatively you could create a sub-query, but I would personally just repeat it.

是的，但不是用那个名字。或者，您可以创建一个子查询，但我个人会重复它。

If there is, would it be better to filter by score > 0 or use the @@ thing?

如果有，按分数> 0 过滤或使用@@ 会更好吗？

The simplest version I can think of is this:

我能想到的最简单的版本是这样的：

SELECT *
FROM pictures
WHERE 'small dog' @@ text_search_vector

The text_search_vectorcould obviously be replaced with something like to_tsvector('english', pictures.title)

该text_search_vector可以明显地喜欢的东西所取代to_tsvector('english', pictures.title)

使用 PostgreSQL 全文搜索排名的最佳方式

提问by Timmmm

回答by isapir

回答by Clodoaldo Neto

回答by Wolph

相关推荐

最近更新

标签

使用 PostgreSQL 全文搜索排名的最佳方式

提问by Timmmm

回答by isapir

回答by Clodoaldo Neto

回答by Wolph

相关推荐

postgresql 将 try/except 与 psycopg2 或“关闭”一起使用？

postgresql 共享内存设置

重命名 PostgreSQL 中的枚举项

postgresql jdbc 批处理性能

相关推荐

最近更新

标签