database 弹性搜索、多个索引 vs 一个索引和不同数据集的类型?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/14465668/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Elastic search, multiple indexes vs one index and types for different data sets?
提问by burzum
I have an application developed using the MVC pattern and I would like to index now multiple models of it, this means each model has a different data structure.
我有一个使用 MVC 模式开发的应用程序,我现在想索引它的多个模型,这意味着每个模型都有不同的数据结构。
- Is it better to use mutliple indexes, one for each model or have a type within the same index for each model? Both ways would also require a different search query I think. I just started on this. 
- Are there differences performancewise between both concepts if the data set is small or huge? 
- 使用多个索引更好,每个模型一个,还是每个模型在同一索引中都有一个类型?我认为这两种方式都需要不同的搜索查询。我刚开始做这件事。 
- 如果数据集很小或很大,这两个概念在性能上是否存在差异? 
I would test the 2nd question myself if somebody could recommend me some good sample data for that purpose.
如果有人可以为此目的向我推荐一些好的样本数据,我会自己测试第二个问题。
回答by Jonathan Moo
There are different implications to both approaches.
两种方法都有不同的含义。
Assuming you are using Elasticsearch's default settings, having 1 index for each model will significantly increase the number of your shards as 1 index will use 5 shards, 5 data models will use 25 shards; while having 5 object types in 1 index is still going to use 5 shards.
假设您使用的是 Elasticsearch 的默认设置,每个模型有 1 个索引将显着增加您的分片数量,因为 1 个索引将使用 5 个分片,5 个数据模型将使用 25 个分片;虽然在 1 个索引中有 5 个对象类型仍将使用 5 个分片。
Implications for having each data model as index:
将每个数据模型作为索引的含义:
- Efficient and fast to search within index, as amount of data should be smaller in each shard since it is distributed to different indices.
- Searching a combination of data models from 2 or more indices is going to generate overhead, because the query will have to be sent to more shards across indices, compiled and sent back to the user.
- Not recommended if your data set is small since you will incur more storage with each additional shard being created and the performance gain is marginal.
- Recommended if your data set is big and your queries are taking a long time to process, since dedicated shards are storing your specific data and it will be easier for Elasticsearch to process.
- 在索引内搜索高效且快速,因为每个分片中的数据量应该更小,因为它分布到不同的索引。
- 从 2 个或更多索引中搜索数据模型的组合将产生开销,因为查询将必须发送到索引中的更多分片,编译并发送回用户。
- 如果您的数据集很小,则不推荐使用,因为每创建一个额外的分片都会产生更多的存储空间,并且性能增益微乎其微。
- 如果您的数据集很大并且您的查询需要很长时间来处理,则推荐使用,因为专用分片存储您的特定数据,并且 Elasticsearch 会更容易处理。
Implications for having each data model as an object type within an index:
将每个数据模型作为索引中的对象类型的含义:
- More data will be stored within the 5 shards of an index, which means there is lesser overhead issues when you query across different data models but your shard size will be significantly bigger.
- More data within the shards is going to take a longer time for Elasticsearch to search through since there are more documents to filter.
- Not recommended if you know you are going through 1 terabytes of data and you are not distributing your data across different indices or multiple shards in your Elasticsearch mapping.
- Recommended for small data sets, because you will not waste storage space for marginal performance gain since each shard take up space in your hardware.
- 更多的数据将存储在索引的 5 个分片中,这意味着当您跨不同数据模型查询时,开销问题较少,但分片大小会显着增大。
- 分片中的更多数据将需要更长的时间让 Elasticsearch 进行搜索,因为要过滤的文档更多。
- 如果您知道自己正在处理 1 TB 的数据,并且没有将数据分布在 Elasticsearch 映射中的不同索引或多个分片之间,则不建议这样做。
- 推荐用于小型数据集,因为您不会因为每个分片占用硬件空间而浪费存储空间以获得边际性能提升。
If you are asking what is too much data vs small data? Typically it depends on the processor speed and the RAM of your hardware, the amount of data you store within each variable in your mapping for Elasticsearch and your query requirements; using many facets in your queries is going to slow down your response time significantly. There is no straightforward answer to this and you will have to benchmark according to your needs.
如果您要问什么是太多数据与小数据?通常,它取决于处理器速度和硬件 RAM、存储在 Elasticsearch 映射中每个变量中的数据量以及查询要求;在您的查询中使用多个方面会显着减慢您的响应时间。对此没有直接的答案,您必须根据自己的需要进行基准测试。
回答by Danack
Although Jonathan's answer was correct at the time, the world has moved on and it now seems that the people behind ElasticSearch have a long term plan to drop support for multiple types:
尽管 Jonathan 的回答当时是正确的,但世界已经发生了变化,现在看来 ElasticSearch 背后的人有一个长期计划来放弃对多种类型的支持:
我们想要达到的目标:我们希望从 Elasticsearch 中删除类型的概念,同时仍然支持父/子。
So for new projects, using only a single type per index will make the eventual upgrade to ElasticSearch 6.x be easier.
因此,对于新项目,每个索引仅使用一种类型将使最终升级到 ElasticSearch 6.x 变得更加容易。
回答by Marcel Matus
Jonathan's answer is great. I would just add few other points to consider:
乔纳森的回答很棒。我只想添加一些其他要点来考虑:
- number of shards can be customized per solution you select. You may have one index with 15 primary shards, or split it to 3 indexes for 5 shards - performance perspective won't change (assuming data are distributed equally)
- think about data usage. Ie. if you use kibana to visualize, it's easier to include/exclude particular index(es), but types has to be filtered in dashboard
- data retention: for application log/metric data, use different indexes if you require different retention period
- 可以根据您选择的解决方案自定义分片数量。您可能有一个包含 15 个主分片的索引,或者将其拆分为 5 个分片的 3 个索引 - 性能观点不会改变(假设数据分布均匀)
- 考虑数据使用。IE。如果您使用 kibana 进行可视化,则更容易包含/排除特定索引,但必须在仪表板中过滤类型
- 数据保留:对于应用程序日志/指标数据,如果需要不同的保留期,请使用不同的索引
回答by Sourav
Both the above answers are great!
以上两个答案都很棒!
I am adding an example of several types in an index. Suppose you are developing an app to search for books in a library. There are few questions to ask to the Library owner,
我在索引中添加了几种类型的示例。假设您正在开发一个应用程序来搜索图书馆中的书籍。有几个问题要问图书馆的主人,
Questions:
问题:
- How many books are you planning to store? 
- What kind of books are you going to store in the library? 
- How are you going to search for books? 
- 您打算存放多少本书? 
- 你打算在图书馆存放什么样的书? 
- 你打算怎么找书? 
Answers:
答案:
- I am planning to store 50 k – to 70 k books (approximately) 
- I will have 15 k -20 k technology related books (computer science, mechanical engineering, chemical engineering and so on), 15 k of historical books, 10 k of medical science books. 10 k of language related books (English, Spanish and so on) 
- Search by authors first name, author last name, year of publish, name of the publisher. (This gives you the idea about what information you should store in the index) 
- 我计划存储 50 k – 70 k 本书(大约) 
- 我将拥有15 k -20 k 技术相关书籍(计算机科学、机械工程、化学工程等)、15 k 历史书籍、10 k 医学科学书籍。10k 语言相关书籍(英语、西班牙语等) 
- 按作者姓名、作者姓氏、出版年份、出版商名称搜索。(这让您了解应该在索引中存储哪些信息) 
From the above answers we can say the schema in our index should look somewhat like this.
从上面的答案我们可以说我们索引中的模式应该看起来像这样。
//This is not the exact mapping, just for the example
//这不是确切的映射,只是举例
            "yearOfPublish":{
                "type": "integer"
            },
            "author":{
                "type": "object",
                "properties": {
                    "firstName":{
                        "type": "string"
                    },
                    "lastName":{
                        "type": "string"
                    }
                }
            },
            "publisherName":{
                "type": "string"
            }
        }
In order to achieve the above we can create one index called Books and can have various types.
为了实现上述目的,我们可以创建一个名为 Books 的索引,并且可以有多种类型。
Index: Book
索引:书籍
Types: Science, Arts
类型:科学、艺术
(Or you can create many types such as Technology, Medical Science, History, Language, if you have lot more books)
(或者,如果您有更多书籍,您可以创建多种类型,例如技术、医学、历史、语言)
Important thing to note here is the schema is similar but the data is not identical. And the other important thing is the total data you are storing.
这里要注意的重要一点是架构相似但数据不相同。另一个重要的事情是您存储的总数据。
Hope the above helps when to go for different types in an Index, if you have different schema you should consider different index. Small index for less data . big index for big data :-)
希望以上内容有助于何时在索引中使用不同类型,如果您有不同的架构,则应考虑不同的索引。数据较少的小索引。大数据的大索引:-)

