Ruby-on-rails 有没有更聪明的方法来重新索引elasticsearch?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13851044/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is there a smarter way to reindex elasticsearch?
提问by concept47
I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas), we have to blow away the entire index and re-index all our Rails models back into Elasticsearch ... this means we have to factor in downtime to re-index all our records.
我问是因为我们的搜索在我们解决问题时处于不断变化的状态,但是每次我们对索引进行更改(更改标记器或过滤器,或分片/副本的数量)时,我们都必须删除整个索引并且将我们所有的 Rails 模型重新索引回 Elasticsearch ......这意味着我们必须考虑停机时间来重新索引我们的所有记录。
Is there a smarter way to do this that I'm not aware of?
有没有更聪明的方法来做到这一点,我不知道?
回答by gertas
I think @karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings. I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.
我认为@karmi 说得对。不过,让我解释得更简单一些。我需要偶尔使用一些新属性或分析设置升级生产模式。我最近开始使用下面描述的场景来进行实时、恒定负载、零停机时间的索引迁移。你可以远程做到这一点。
Here are steps:
以下是步骤:
Assumptions:
假设:
- You have index
real1and aliasesreal_write,real_readpointing to it, - the client writes only to
real_writeand reads only fromreal_read, _sourceproperty of document is available.
- 你有 index
real1和 aliasesreal_write,real_read指向它, - 客户端只写入
real_write和读取只real_read, _source文档的属性可用。
1. New index
1. 新索引
Create real2index with new mapping and settings of your choice.
real2使用您选择的新映射和设置创建索引。
2. Writer alias switch
2. Writer 别名切换
Using following bulk query switch write alias.
使用以下批量查询开关写入别名。
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_write" } },
{ "add" : { "index" : "real2", "alias" : "real_write" } }
]
}'
This is atomic operation. From this time real2is populated with new client's data on all nodes. Readers still use old real1via real_read. This is eventual consistency.
这是原子操作。从此时起real2,所有节点上都会填充新客户端的数据。读者仍然使用旧的real1via real_read。这是最终的一致性。
3. Old data migration
3.旧数据迁移
Data must be migrated from real1to real2, however new documents in real2can't be overwritten with old entries. Migrating script should use bulkAPI with createoperation (not indexor update). I use simple Ruby script es-reindexwhich has nice E.T.A. status:
数据必须从 迁移real1到real2,但是real2不能用旧条目覆盖 中的新文档。迁移脚本应该使用bulk带有create操作的API (不是index或update)。我使用简单的 Ruby 脚本es-reindex,它具有良好的 ETA 状态:
$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2
UPDATE 2017You may consider new Reindex APIinstead of using the script. It has lot of interesting features like conflicts reporting etc.
2017 年更新您可以考虑使用新的Reindex API而不是使用脚本。它有很多有趣的功能,比如冲突报告等。
4. Reader alias switch
4.读者别名切换
Now real2is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:
现在real2是最新的并且客户正在写入它,但是他们仍在从real1. 让我们更新读者别名:
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_read" } },
{ "add" : { "index" : "real2", "alias" : "real_read" } }
]
}'
5. Backup and delete old index
5.备份和删除旧索引
Writes and reads go to real2. You can backup and delete real1index from ES cluster.
写入和读取转到real2。real1ES集群可以备份和删除索引。
Done!
完毕!
回答by karmi
Yes, there are smarter ways how to re-index your data without downtime.
是的,有更聪明的方法可以在不停机的情况下重新索引数据。
First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.
首先,永远不要使用“最终”索引名称作为您真正的索引名称。因此,如果您想将索引命名为“articles”,请不要将该名称用作物理索引,而是创建一个索引,例如“articles-2012-12-12”或“articles-A”、“articles” -1"等
Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.
其次,创建一个指向该索引的别名“alias”。然后,您的应用程序将使用此别名,因此您无需手动更改索引名称、重新启动应用程序等。
Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.
第三,当您想要或需要重新索引数据时,将它们重新索引到不同的索引中,例如“articles-B”——Tire 的索引工具链中的所有工具都支持您。
When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.
完成后,将别名指向新索引。通过这种方式,您不仅可以最大限度地减少停机时间(没有停机时间),而且还拥有一个安全的快照:如果您以某种方式将索引弄乱了新索引,您可以切换回旧索引,直到您解决问题问题。
回答by Ari
Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!
最近写了一篇关于我如何在没有停机的情况下处理重新索引的博客文章。需要一些时间来弄清楚需要准备好这样做的所有小事情。希望这可以帮助!
https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html
https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html
To summarize:
总结一下:
Step 1: Prepare New Index
第 1 步:准备新索引
Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.
使用新映射创建新索引。这可以在 Elasticsearch 的同一个实例上,也可以在一个全新的实例上。
Step 2: Keep Indexes Up To Date
第 2 步:使索引保持最新
While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.
在重新编制索引时,您希望使新旧索引保持最新。对于写操作,这可以通过将写操作发送到新旧索引上的后台工作程序来完成。
Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.
删除有点棘手,因为删除和重新索引记录到新索引之间存在竞争条件。因此,您需要跟踪在重新编制索引期间需要删除的记录,并在完成后处理这些记录。如果您没有执行多次删除,另一种方法是在重新索引期间消除删除的可能性。
Step 3: Perform Reindexing
步骤 3:执行重新索引
You'll want to use a scrolled searchfor reading the data and bulk APIfor inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.
您需要使用滚动搜索来读取数据并使用批量 API进行插入。由于在第 2 步之后,您将在后台将新的和更新的文档写入新索引,因此您需要确保不使用批量 API 请求更新新索引中的现有文档。
This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.
这意味着您希望对批量 API 请求执行的操作是创建,而不是索引。来自文档:“如果具有相同索引和类型的文档已经存在,则创建将失败,而索引将根据需要添加或替换文档”。这里的要点是您不希望滚动搜索快照中的旧数据覆盖新索引中的新数据。
There's a great script on github to help you with this process: es-reindex.
github 上有一个很棒的脚本可以帮助您完成此过程:es-reindex。
Step 4: Switch Over
第 4 步:切换
Once you're finished reindexing, it's time to switch your search over to the new index. You'll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.
完成重新索引后,是时候将搜索切换到新索引了。您需要重新打开删除或处理新索引的排队删除作业。您可能会注意到,一开始搜索新索引有点慢。这是因为 Elasticsearch 和 JVM 需要时间来预热。
Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.
执行您需要的任何代码更改,以便您的应用程序开始搜索新索引。如果遇到问题并需要回滚,您可以继续写入旧索引。如果您觉得这是不必要的,您可以停止写入。
Step 5: Clean Up
第 5 步:清理
At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:
此时您应该完全过渡到新索引。如果一切顺利,请执行任何必要的清理,例如:
- Delete the old index host if it's different from the new
- Remove serialization code related to your old index
- 如果旧的索引主机与新的不同,则删除旧的索引主机
- 删除与旧索引相关的序列化代码
回答by Emil Hajric
Maybe create another index, and reindex all the data onto that one, and then make the switch when it's done re-indexing ?
也许创建另一个索引,并将所有数据重新索引到该索引上,然后在完成重新索引后进行切换?

