用于过滤应用程序的 elasticsearch vs MongoDB

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12723239/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 12:52:25  来源:igfitidea点击:

elasticsearch v.s. MongoDB for filtering application

mongodbelasticsearch

提问by matanster

This question is about making an architectural choice prior to delving into the details of experimentation and implementation. It's about the suitability, in scalability and performance terms, of elasticsearch v.s. MongoDB, for a somewhat specific purpose.

这个问题是关于在深入研究实验和实现的细节之前做出架构选择。这是关于 Elasticsearch 与 MongoDB 在可扩展性和性能方面的适用性,用于某种特定目的。

Hypothetically both store data objects that have fields and values, and allow querying that body of objects. So presumably filtering out subsets of the objects according to fields selected ad-hoc, is something fit for both.

假设两者都存储具有字段和值的数据对象,并允许查询该对象主体。因此,大概根据选择的临时字段过滤掉对象的子集,这两者都适合。

My application will revolve around selecting objects according to criteria. It would select objects by filtering simultaneously by more than a single field, put differently, its query filtering criteria would typically comprise anywhere between 1 and 5 fields, maybe more in some cases. Whereas the fields chosen as filters would be a subset of a much larger amount of fields. Picture some 20 field names existing, and each query is an attempt to filter the objects by few fields out of those overall 20 fields (It can be less or more than 20 overall field names existing, I just used this number to demonstrate the ratio of fields to fields used as filters in every discrete query). The filtering can be by the existence of the chosen fields, as well as by the field values, e.g. filtering out objects that have field A, and their field B is between x and y, and their field C is equal to w.

我的应用程序将围绕根据标准选择对象。它会通过同时过滤多个字段来选择对象,换句话说,它的查询过滤标准通常包含 1 到 5 个字段,在某些情况下可能更多。而选择作为过滤器的字段将是大量字段的子集。想象一下现有的大约 20 个字段名称,每个查询都是尝试通过这 20 个字段中的几个字段过滤对象(存在的总字段名称可以少于或多于 20 个,我只是用这个数字来说明字段到在每个离散查询中用作过滤器的字段)。过滤可以通过所选字段的存在以及字段值来进行,例如过滤掉具有字段 A 且其字段 B 介于 x 和 y 之间的对象,

My application will be continuously doing this sort of filtering, whereas there would be nothing or very little constant in terms of which fields are used for the filtering at any moment. Perhaps in elasticsearch indexes need to be defined, but maybe even without indexes speed is at par with that of MongoDB.

我的应用程序将不断进行这种过滤,而在任何时候用于过滤的字段方面都没有或几乎没有恒定不变。也许在elasticsearch中需要定义索引,但也许即使没有索引,速度也与MongoDB不相上下。

As per the data getting into the store, there are no special details about that.. the objects would be almost never changed after having been inserted. Perhaps old objects would need to be dropped, I'd like to assume both data stores support expire deleting stuff internally or by an application made query. (Less frequently, objects that fit a certain query would need to be dropped as well).

根据进入 store 的数据,没有关于它的特殊细节......对象在插入后几乎不会改变。也许需要删除旧对象,我想假设两个数据存储都支持在内部或通过应用程序查询的过期删除内容。(不太常见的是,也需要删除适合某个查询的对象)。

What do you think? And, have you experimented this aspect?

你怎么认为?而且,你有没有试验过这个方面?

I am interested in the performance and the scalability of it, of each of the two data stores, for this kind of task. This is the sort of an architectural desing question, and details of store-specific options or query cornerstones that should make it well architected are welcome as a demonstration of a fully thought-out suggestion.

对于此类任务,我对两个数据存储中的每一个的性能和可扩展性感兴趣。这是一种架构设计问题,欢迎提供应使其架构良好的特定商店选项或查询基石的详细信息,以展示经过深思熟虑的建议。

Thanks!

谢谢!

回答by gstathis

First off, there is an important distinction to make here: MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed by Lucene. People have been talking about using Elasticsearch as a general purpose database but know that it was not its' original design. I think that general purpose NoSQL databases and search engines are headed for consolidation but as it stands, the two come from two very different camps.

首先,这里有一个重要的区别:MongoDB 是一个通用数据库,Elasticsearch 是一个由 Lucene 支持的分布式文本搜索引擎。人们一直在谈论使用 Elasticsearch 作为通用数据库,但知道这不是它的原始设计。我认为通用 NoSQL 数据库和搜索引擎正在走向整合,但就目前而言,两者来自两个截然不同的阵营。

We are using both MongoDB and Elasticsearch in my company. We store our data in MongoDB and use Elasticsearch exclusively for its' full-text search capabilities. We only send a subset of the mongo data fields that we need to query to elastic. Our use case differs from yours in that our Mongo data changes all the time: a record, or a subset of the fields of a record, can be updated several times a day and this can call for re-indexing of that record to elastic. For that reason alone, using elastic as the sole data store is not a good option for us, as we can't update select fields; we would need to re-index a document in its' entirety. This is not an elastic limitation, this is how Lucene works, the underlying search engine behind elastic. In your case, the fact that records won't be changed once stored saves you from having to make that choice. Having said that, if data safety is a concern, I would think twice about using Elasticsearch as the only storage mechanism for your data. It may get there at some point but I'm not sure it's there yet.

我们在我的公司同时使用 MongoDB 和 Elasticsearch。我们将数据存储在 MongoDB 中,并专门使用 Elasticsearch 来实现其全文搜索功能。我们只将需要查询的 mongo 数据字段的子集发送到弹性。我们的用例与您的不同之处在于我们的 Mongo 数据一直在变化:记录或记录字段的子集可以每天更新多次,这可能需要将该记录重新索引到弹性。仅出于这个原因,使用弹性作为唯一的数据存储对我们来说不是一个好的选择,因为我们无法更新选择字段;我们需要重新索引整个文档。这不是弹性限制,这就是 Lucene 的工作方式,它是弹性背后的底层搜索引擎。在你的情况下,记录不会 一旦存储就不要更改,这样您就不必做出选择了。话虽如此,如果数据安全是一个问题,我会三思而后行,将 Elasticsearch 作为数据的唯一存储机制。它可能会在某个时候到达那里,但我不确定它还在那里。

In terms of speed, not only is Elastic/Lucene on par with the querying speed of Mongo, in your case where there is "very little constant in terms of which fields are used for the filtering at any moment", it could be orders of magnitude faster, especially as the datasets become larger. The difference lies in the underlying query implementations:

在速度方面,Elastic/Lucene 不仅与 Mongo 的查询速度相当,在您的情况下,“在任何时候用于过滤的字段几乎没有什么常数”,它可能是幅度更快,尤其是当数据集变得更大时。不同之处在于底层查询实现:

  • Elastic/Lucene use the Vector Space Modeland inverted indexesfor Information Retrieval, which are highly efficient ways of comparing record similarity against a query. When you query Elastic/Lucene, it already knows the answer; most of its' work lies in ranking the results for you by the most likely ones to match your query terms. This is an important point: search engines, as opposed to databases, can't guarantee you exact results; they rank results by how close they get to your query. It just so happens that most of the times, the results are close to exact.
  • Mongo's approach is that of a more general purpose data store; it compares JSON documents against one another. You can get great performance out of it by all means, but you need to carefully craft your indexes to match the queries you will be running. Specifically, if you have multiple fields by which you will query, you need to carefully craft your compound keysso that they reduce the dataset that will be queried as fast as possible. E.g. your first key should filter down the majority of your dataset, your second should further filter down what left, and so on and so forth. If your queries don't match the keys and the order of those keys in the defined indexes, your performance will drop quite a bit. On the other hand, Mongo is a true database, so if accuracy is what what you need, the answers it will give will be spot on.
  • Elastic/Lucene 使用向量空间模型倒排索引进行信息检索,这是比较记录相似度与查询的高效方法。当你查询 Elastic/Lucene 时,它​​已经知道答案了;它的大部分工作在于根据最有可能与您的查询词匹配的结果对您进行排名。这一点很重要:与数据库相反,搜索引擎不能保证您得到准确的结果;他们根据与您的查询的接近程度对结果进行排名。碰巧的是,大多数情况下,结果接近准确。
  • Mongo 的方法是更通用的数据存储;它将 JSON 文档相互比较。您可以通过各种方式获得出色的性能,但您需要精心制作索引以匹配您将运行的查询。具体来说,如果您有多个要查询的字段,则需要精心制作复合键以便他们尽可能快地减少将要查询的数据集。例如,您的第一个键应该过滤掉大部分数据集,第二个应该进一步过滤掉剩下的,依此类推。如果您的查询与已定义索引中的键和这些键的顺序不匹配,您的性能将会下降很多。另一方面,Mongo 是一个真正的数据库,因此如果您需要的是准确性,那么它将给出准确的答案。

For expiring old records, Elastic has a built in TTL feature. Mongo just introduced it as of version 2.2 I think.

对于过期的旧记录,Elastic 有一个内置的 TTL 功能。我认为 Mongo 刚刚在 2.2 版中引入了它。

Since I don't know your other requirements such as expected data size, transactions, accuracy or what your filters will look like, it's hard to make any specific recommendations. Hopefully, there is enough here to get you started.

由于我不知道您的其他要求,例如预期的数据大小、事务、准确性或您的过滤器的外观,因此很难提出任何具体建议。希望这里有足够的内容可以帮助您入门。