postgresql 为什么 Solr 比 Postgres 快这么多？

Question

提问by cberner

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."

我最近从 Postgres 切换到 Solr，发现查询速度提高了约 50 倍。我们运行的查询涉及多个范围，我们的数据是车辆列表。例如：“查找所有里程 < 50,000, $5,000 < price < $10,000, make=Mazda...”

I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).

我在 Postgres 的所有相关列上创建了索引，所以这应该是一个相当公平的比较。查看 Postgres 中的查询计划，尽管它仍然只使用单个索引然后进行扫描（我认为是因为它无法使用所有不同的索引）。

As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.

据我了解，Postgres 和 Solr 使用模糊相似的数据结构（B 树），并且它们都将数据缓存在内存中。所以我想知道这么大的性能差异是从哪里来的。

What differences in architecture would explain this?

架构上的哪些差异可以解释这一点？

Answer 1

回答by jpountz

First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t))where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d))where d is the number of documents. When many documents share the same value for some field, this can be a big win.

首先，Solr 不使用 B 树。Lucene（Solr 使用的底层库）索引由只读段组成。对于每个段，Lucene 维护一个术语字典，其中包含出现在段中的术语列表，按字典顺序排列。在这个术语词典中查找术语是使用二分搜索进行的，因此单术语查找的成本是O(log(t))其中 t 是术语的数量。相反，使用标准 RDBMS 的索引成本O(log(d))，其中 d 是文档数。当许多文档对某个字段共享相同的值时，这可能是一个巨大的胜利。

Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queriesa few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)

此外，Lucene 提交者 Uwe Schindler几年前增加了对高性能数字范围查询的支持。对于数字字段的每个值，Lucene 存储多个具有不同精度的值。这允许 Lucene 非常有效地运行范围查询。由于您的用例似乎大量利用数字范围查询，这可以解释为什么 Solr 速度如此之快。（有关更多信息，请阅读非常有趣的 javadoc，并提供相关研究论文的链接。）

But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).

但是 Solr 只能这样做，因为它没有 RDBMS 具有的所有约束。例如，Solr 在一次更新单个文档方面非常糟糕（它更喜欢批量更新）。

Answer 2

回答by kgrittn

You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.

您并没有真正说明您为调整 PostgreSQL 实例或查询所做的工作。通过调整和/或以优化更好的格式重述查询，PostgreSQL 查询速度提高 50 倍并不罕见。

Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.

就在本周，有人使用 Java 和多个查询编写了一份工作报告，根据它在四小时内取得的进展，大约需要一个月才能完成。（它需要访问五个不同的表，每个表都有数亿行。）我使用几个 CTE 和一个窗口函数重写了它，以便它在不到十分钟的时间内运行并直接从查询中生成所需的结果。这是 4400 倍的加速。

Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performedin each product, but more to do with ease of usefor your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.

也许对您的问题的最佳答案与如何在每个产品中执行搜索的技术细节无关，而与您的特定用例的易用性有关。很明显，您能够找到使用 Solr 进行搜索的快速方法，而且比 PostgreSQL 的麻烦更少，而且可能仅此而已。

I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peacein text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstoretype or JSONcolumns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.

我提供了一个简短的示例，说明如何在 PostgreSQL 中完成多个条件的文本搜索，以及一些小的调整如何产生巨大的性能差异。为了保持快速和简单，我只是将文本形式的War与和平运行到测试数据库中，每个“文档”都是一个文本行。如果数据必须松散定义，类似的技术可以用于使用hstore类型或JSON列的任意字段。如果有单独的列有自己的索引，使用索引的好处往往要大得多。

-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
  (
    lineno serial PRIMARY KEY,
    linetext text NOT NULL,
    tsv tsvector
  );

-- Load from downloaded data into database.
COPY war_and_peace (linetext)
  FROM '/home/kgrittn/Downloads/war-and-peace.txt';

-- "Digest" data to lexemes.
UPDATE war_and_peace
  SET tsv = to_tsvector('english', linetext);

-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
  ON war_and_peace
  USING gist (tsv);

-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;

Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:

设置好索引后，我会展示一些搜索，其中包含两种类型的索引的行数和时间：

-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'gentlemen');

84 rows, gist: 2.006 ms, gin: 0.194 ms

84 行，gist：2.006 ms，gin：0.194 ms

-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'ladies');

184 rows, gist: 3.549 ms, gin: 0.328 ms

184 行，要点：3.549 毫秒，杜松子酒：0.328 毫秒

-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen');

1 row, gist: 0.971 ms, gin: 0.104 ms

1 行，要点：0.971 毫秒，杜松子酒：0.104 毫秒

Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.

现在，由于 GIN 索引比 GiST 索引快 10 倍左右，您可能想知道为什么有人会使用 GiST 来索引文本数据。答案是 GiST 通常维护起来更快。因此，如果您的文本数据高度易变，GiST 索引可能会在整体负载上胜出，而如果您只对搜索时间或以读取为主的工作负载感兴趣，GIN 索引将胜出。

Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.

如果没有索引，上述查询需要 17.943 毫秒到 23.397 毫秒，因为它们必须扫描整个表并检查每一行是否匹配。

The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.

对同时包含“女士”和“男士”的行进行 GIN 索引搜索比在完全相同的数据库中进行表扫描快 172 倍以上。显然，对于更大的文档，索引的好处将比用于此测试的文档更加显着。

The setup is, of course, a one-time thing. With a trigger to maintain the tsvcolumn, any changes made would instantly be searchable without redoing any of the setup.

当然，设置是一次性的。使用触发器来维护tsv列，所做的任何更改都可以立即搜索，而无需重做任何设置。

With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZEof your query, someone can almost always spot the problem and suggest how to get it to run faster.

对于慢速 PostgreSQL 查询，如果您显示表结构（包括索引）、问题查询以及查询运行的输出EXPLAIN ANALYZE，几乎总能有人发现问题并建议如何使其运行得更快。

UPDATE(Dec 9 '16)

更新（2016 年 12 月 9 日）

I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and"gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.

我没有提到我用来获得先前时间的内容，但根据日期，它可能是 9.2 主要版本。我刚刚遇到了这个旧线程，并在使用版本 9.6.1 的同一硬件上再次尝试，看看是否有任何干预性能调整对这个例子有帮助。仅查询一个参数的性能仅提高了约 2%，但在使用 GIN（倒排）索引时，搜索同时包含“女士”和“男士”的行的速度大约翻了一番，达到 0.053 毫秒（即 53 微秒）。

Answer 3

回答by Mark O'Connor

Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.

Solr 主要用于搜索数据，而不是用于存储。这使它能够丢弃 RDMS 所需的大部分功能。所以它（或者更确切地说lucene）专注于纯粹的索引数据。

As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"

毫无疑问，您已经发现，Solr 能够从它的索引中搜索和检索数据。后者（可选）功能导致了一个自然的问题......“我可以将 Solr 用作数据库吗？”

The answer is a qualified yes, and I refer you to the following:

答案是肯定的，我向您推荐以下内容：

My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.

我个人的观点是，最好将 Solr 视为我的应用程序和我的数据库中掌握的数据之间的可搜索缓存。这样我才能两全其美。

Answer 4

回答by Yavar

This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.

最大的不同在于 Lucene/Solr 索引就像一个单表数据库，不支持任何关系查询 (JOIN)。请记住，索引通常仅用于支持搜索，而不是数据的主要来源。因此，您的数据库可能处于“第三范式”，但索引将完全非规范化，并且主要包含需要搜索的数据。

Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.

另一个可能的原因是数据库普遍存在内部碎片，它们需要在巨大的请求上执行过多的半随机 I/O 任务。

What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.

这意味着，例如，考虑数据库的索引架构，查询导致索引，索引又导致数据。如果要恢复的数据广泛分布，结果将需要很长时间，这似乎就是数据库中发生的情况。

Answer 5

回答by Tejas Patil

Please read thisand this.

请阅读这个和这个。

Solr (Lucene) creates an inverted indexwhich is where retrieving data gets quite faster. I readthat PostgreSQL also has similar facility but not sure if you had used that.

Solr (Lucene) 创建了一个倒排索引，这是检索数据变得更快的地方。我读到PostgreSQL 也有类似的功能，但不确定您是否使用过它。

The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

您观察到的性能差异还可以归因于“正在搜索什么？”、“用户查询是什么？”

postgresql 为什么 Solr 比 Postgres 快这么多？

提问by cberner

回答by jpountz

回答by kgrittn

回答by Mark O'Connor

回答by Yavar

回答by Tejas Patil

相关推荐

最近更新

标签

postgresql 为什么 Solr 比 Postgres 快这么多？

提问by cberner

回答by jpountz

回答by kgrittn

回答by Mark O'Connor

回答by Yavar

回答by Tejas Patil

相关推荐

postgresql SQLAlchemy 对 Postgres 模式的支持

PostgreSQL pg_dump

如何使用 pg_dump 或 psql 从 *.sql 恢复 PostgreSQL 表？

Postgresql - 如何在表中获取另一个表中没有匹配项的条目

相关推荐

最近更新

标签