postgresql CouchDB 中的全文搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5285787/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 22:53:54  来源:igfitidea点击:

Full-text search in CouchDB

performancepostgresqlindexingfull-text-searchcouchdb

提问by Jan L.

I have a problem and hope to get an answer from you :-)

我有一个问题,希望得到你的答复:-)

So, I took geonames.org and imported all their data of German cities with all districts.

因此,我使用 geonames.org 并导入了他们所有地区的德国城市的所有数据。

If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :( The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.

如果我输入“Hamburg”,它会列出“Hamburg Centre, Hamburg Airport”等等。该应用程序位于无法访问 Internet 的封闭网络中,因此我无法访问 geonames.org 网络服务并且必须导入数据。:( 城市及其所有地区都作为自动完成工作。所以每次击键都会导致 XHR 请求等等。

Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.

现在我的客户问是否可以在其中包含世界上的所有数据。最后,大约 5.000.000 行和 45.000.000 个替代名称等。

Postgres needs about 3 seconds per query which makes the auto complete unusable.

Postgres 每次查询需要大约 3 秒,这使得自动完成无法使用。

Now I thought of CouchDb, have already worked with it. My question:

现在我想到了 CouchDb,已经使用过它。我的问题:

I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.

我想发布“火腿”,并且我希望 CouchDB 获取以“火腿”开头的所有文档。如果我输入“汉堡”,我希望它返回汉堡等等。

Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, its rather static!

CouchDB 是适合它的数据库吗?您可以推荐哪些其他数据库以低延迟(可能在内存中)和数百万个数据集进行响应?数据集不会定期更改,它相当静态!

回答by ssmir

If I understand your problem right, probably all you need is already built in the CouchDB.

如果我理解你的问题是正确的,你所需要的可能已经内置在 CouchDB 中了。

  1. To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
  2. If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
  1. 获取名称以“Ham”开头的一系列文档。您可以使用带有字符串范围请求startkey="Ham"&endkey="Ham\ufff0"
  2. 如果您需要更全面的搜索,您可以创建一个包含其他地方名称作为关键字的视图。因此,您可以再次使用上述技术查询范围。

Here is a view function to make this:

这是一个视图函数来实现这一点:

function(doc) {
    for (var name in doc.places) {
        emit(name, doc._id);
    }
}

Also see the CouchOne blog post about CouchDB typeahead and autocomplete searchand this discussion on the mailing list about CouchDB autocomplete.

另请参阅有关CouchDB预先输入和自动完成搜索的 CouchOne 博客文章以及有关CouchDB 自动完成的邮件列表中的此讨论。

回答by Erwin Brandstetter

Optimized search with PostgreSQL

使用 PostgreSQL 优化搜索

Your search is anchored at the startand no fuzzy search logicis required. This is notthe typical use case for full text search.

您搜索的开始锚定无模糊搜索逻辑是必需的。这不是全文搜索的典型用例。

If it gets more fuzzy or your search is notanchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
More on pattern matching in Postgres.

如果它变得更加模糊或者您的搜索在开始时没有锚定,请在此处查看更多信息:
用于自动完成字段的类似 UTF-8 字符串
有关 Postgres 中模式匹配的更多信息。

In PostgreSQL you can make use of advanced index featuresthat should make the query very fast. In particular look at operator classesand indexes on expressions.

在 PostgreSQL 中,您可以使用高级索引功能,这些功能应该可以使查询非常快。特别是查看运算符类表达式的索引

1) text_pattern_ops

1) text_pattern_ops

Assuming your column is of type text, you would use a special index for text pattern operatorslike this:

假设您的列是文本类型,您可以为文本模式运算符使用特殊索引,如下所示:

CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);

SELECT name
FROM   tbl
WHERE  name ~~ ('Hambu' || '%');

This is assuming that you operate with a database locale other than C- most likely de_DE.UTF-8in your case. You couldalso set up a database with locale 'C'. I quote the manual here:

这是假设您使用的数据库区域设置不是C- 最有可能de_DE.UTF-8在您的情况下。您还可以使用语言环境“C”设置数据库。我在这里引用手册

If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.

如果您确实使用 C 语言环境,则不需要 xxx_pattern_ops 运算符类,因为具有默认运算符类的索引可用于 C 语言环境中的模式匹配查询。



2) Index on expression

2) 表达式索引

I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:

我想你也会想让搜索不区分大小写。所以让我们再进一步,让它成为表达式的索引:

CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(name) ~~ (lower('Hambu') || '%');

To make use of the index, the WHEREclause has to match the the index expression.

要使用索引,WHERE子句必须匹配索引表达式。



3) Optimize index size and speed

3) 优化索引大小和速度

Finally, you might also want to impose a limit on the number of leading charactersto minimize the size of your index and speed things up even further:

最后,您可能还想限制前导字符数量,以最小化索引的大小并进一步加快速度:

CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%');

left()was introduces with Postgres 9.1. Use substring(name, 1,10)in older versions.

left()是在 Postgres 9.1 中引入的。substring(name, 1,10)在旧版本中使用。



4) Cover all possible requests

4) 涵盖所有可能的请求

What about strings with more than 10 characters?

超过 10 个字符的字符串呢?

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND    lower(name) ~~ (lower('Hambu678910') || '%');

This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left()effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.

这看起来多余,但您需要以这种方式拼写出来才能实际使用索引。索引搜索会将其缩小到几个条目,附加子句过滤其余条目。尝试找到最佳位置。取决于数据分布和典型用例。10 个字符似乎是一个很好的起点。对于超过 10 个字符,left()有效地变成了一种非常快速且简单的散列算法,对于许多(但不是全部)用例来说已经足够好了。



5) Optimize disc representation with CLUSTER

5)优化光盘表示 CLUSTER

So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. I quote the manual:

因此,主要的访问模式将是根据我们的索引检索一堆相邻的行lower_left_name_text_pattern_ops_idx。而你大部分时间都在阅读,几乎从不写作。这是一个教科书般的案例CLUSTER。我引用手册

When a table is clustered, it is physically reordered based on the index information.

当表被聚簇时,它会根据索引信息进行物理重新排序。

With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.

对于像您这样的大表,这可以显着提高响应时间,因为要提取的所有行都在磁盘上的相同或相邻块中。

First call:

第一次调用:

CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;

Information which index to use will be saved and successive calls will re-cluster the table:

将保存要使用的索引的信息,后续调用将重新集群表:

CLUSTER tbl;
CLUSTER;    -- cluster all tables in the db that have previously been clustered.

If you don't want to repeat it:

如果不想重复:

ALTER TABLE tbl SET WITHOUT CLUSTER;

For tables with more write load look into pg_repack, which can doe the same without exclusive lock on the table.

对于具有更多写入负载的表,请查看 into pg_repack,它可以在不使用表排他锁的情况下执行相同操作。



6) Prevent too many rows in the result

6) 防止结果中的行过多

Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMITthe number of rows returned:

要求搜索字符串至少包含 3 或 4 个字符。为了完整起见,我添加了这个,无论如何你可能会这样做。
LIMIT返回的行数:

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT  501;

If your query returns more than 500 rows, tell the user to narrow down his search.

如果您的查询返回超过 500 行,请告诉用户缩小搜索范围。



7) Optimize filter method (operators)

7) 优化过滤方式(算子)

If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:

如果您绝对必须挤出每一微秒,您可以使用text_pattern_ops 系列的运算符。像这样:

SELECT name
FROM   tbl
WHERE  lower(left(name, 10)) ~>=~ lower('Hambu')
AND    lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));

You gain very little with this last stunt. Normally, standard operators are a better choice.

通过这最后的特技,您获得的收益很少。通常,标准运算符是更好的选择。



If you do all that, search time will be reduced to a matter of milliseconds.

如果你这样做,搜索时间将减少到几毫秒。

回答by deluan

I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solror ElasticSearch.

我认为更好的方法是将数据保存在数据库(Postgres 或 CouchDB)中,并使用全文搜索引擎(如LuceneSolrElasticSearch)对其进行索引

Having said that, there's a project integrating CouchDB with Lucene.

话虽如此,有一个将 CouchDB 与 Lucene 集成项目