java 如何减少 Elasticsearch 滚动响应时间?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13464821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I reduce Elasticsearch scroll response time?
提问by dranxo
I have a query returning ~200K hits from 7 different indices distributed across our cluster. I process my results as:
我有一个查询从分布在我们集群中的 7 个不同的索引返回 ~200K 的命中。我将结果处理为:
while (true) {
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet();
for (SearchHit hit : scrollResp.getHits()){
//process hit}
//Break condition: No hits are returned
if (scrollResp.hits().hits().length == 0) {
break;
}
}
I'm noticing that the client.prepareSearchScroll line can hang for quite some time before returning the next set of search hits. This seems to get worse the longer I run the code for.
我注意到 client.prepareSearchScroll 行在返回下一组搜索命中之前可能会挂起很长一段时间。我运行代码的时间越长,这似乎变得更糟。
My setup for the search is:
我的搜索设置是:
SearchRequestBuilder searchBuilder = client.prepareSearch( index_names )
.setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000)) //TimeValue?
.setQuery( qb )
.setFrom(0) //?
.setSize(5000); //number of jsons to get in each search, what should it be? I have no idea.
SearchResponse scrollResp = searchBuilder.execute().actionGet();
Is it expected that scanning and scrolling just takes a long time when examining many results? I'm very new to Elastic Search so keep in mind that I may be missing something very obvious.
在检查许多结果时,是否预计扫描和滚动只需要很长时间?我对 Elastic Search 非常陌生,所以请记住,我可能会遗漏一些非常明显的内容。
My query:
我的查询:
QueryBuilder qb = QueryBuilders.boolQuery().must(QueryBuilders.termsQuery("tweet", interesting_words));
回答by imotov
.setSize(5000)
means that each client.prepareSearchScroll
call is going to retrieve 5000 records per shard. You are requesting back source, and if your records are big, assembling 5000 records in memory might take awhile. I would suggest trying a smaller number. Try 100 and 10 to see if you are getting a better performance.
.setSize(5000)
意味着每次client.prepareSearchScroll
调用将检索每个分片 5000 条记录。您正在请求回溯源,如果您的记录很大,在内存中组装 5000 条记录可能需要一段时间。我建议尝试一个较小的数字。尝试 100 和 10,看看您是否获得了更好的性能。
.setFrom(0)
is not necessary.
.setFrom(0)
没有必要。
回答by user1471465
I'm going to add another answer here, because I was very puzzled by this behaviour and it took me a long time to find the answer in the comments by @AaronM
我将在这里添加另一个答案,因为我对这种行为感到非常困惑,我花了很长时间才在@AaronM 的评论中找到答案
This applies to ES 1.7.2, using the java API.
这适用于 ES 1.7.2,使用 java API。
I was scrolling/scanning an index of 500m records, but with a query that returns about 400k rows.
我正在滚动/扫描 500m 记录的索引,但查询返回大约 400k 行。
I started off with a scroll size of 1,000 which seemed to me a reasonable trade-off in terms of network versus CPU.
我从 1,000 的滚动大小开始,这在我看来是网络与 CPU 的合理权衡。
This query ran terribly slowly, taking about 30 minutes to complete, with very long pauses between fetches from the cursor.
这个查询运行得非常缓慢,大约需要 30 分钟才能完成,在从游标中提取之间有很长时间的停顿。
I worried that maybe it was just the query I was running and did not believe that decreasing the scroll size could help, as 1000 seemed tiny.
我担心这可能只是我正在运行的查询,并且不相信减小滚动大小会有所帮助,因为 1000 看起来很小。
However, seeing AaronM's comment above, I tried a scroll size of 10.
但是,看到上面 AaronM 的评论,我尝试了 10 的滚动大小。
The whole job completed in 30 seconds (and this was whether I had restarted ES or not, so presumably nothing to do with caching) - a speed-up of about 60x!!!
整个工作在 30 秒内完成(这就是我是否重新启动了 ES,所以大概与缓存无关)- 速度提高了大约 60 倍!!!
So if you're having performance problems with scroll/scan, I highly recommend trying decreasing the scroll size. I couldn't find much about this on the internet, so posted this here.
因此,如果您在滚动/扫描方面遇到性能问题,我强烈建议您尝试减小滚动大小。我在互联网上找不到太多关于此的信息,因此将其发布在这里。
回答by Thomas Decaux
- Query data node not client node or master node
- Select the fields you need with
filter_path
property - Set scroll size according your document size, there is no a magic rule, you must set value and try, and so on
- Monitor your network band width
- If it's not enough, let's go for some multi-threads stuff:
- 查询数据节点不是客户端节点或主节点
- 使用
filter_path
属性选择您需要的字段 - 根据你的文档大小设置滚动大小,没有什么神奇的规则,你必须设置值并尝试,等等
- 监控您的网络带宽
- 如果这还不够,让我们来一些多线程的东西:
Think that elasticsearch index is composed of multiple shards. This design means you can parallelize operation.
认为elasticsearch索引是由多个分片组成的。这种设计意味着您可以并行化操作。
Let's say your index has 3 shards, and your cluster 3 nodes (good practice to have more nodes than shards by index).
假设您的索引有 3 个分片,您的集群有 3 个节点(好的做法是按索引拥有比分片更多的节点)。
You could run 3 Java "workers", in a separate thread each, that will search scroll a different shard and node, and use a queue to "centralize" the results.
您可以运行 3 个 Java“工人”,每个在一个单独的线程中,这将搜索滚动不同的分片和节点,并使用队列来“集中”结果。
This way, you will have a good performance!
这样,你会有一个很好的表现!
This is what the elasticsearch-hadoop library does.
这就是 elasticsearch-hadoop 库所做的。
To retrieve shards/nodes details about an index, use the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.htmlAPI.
要检索有关索引的分片/节点详细信息,请使用https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.htmlAPI。
回答by huuthang
You can read document here
你可以在这里阅读文档
I think Timevalue is time to keep scrolling alive
我认为 Timevalue 是时候保持滚动了
setScroll(TimeValue keepAlive)
If set, will enable scrolling of the search request for the specified timeout.
如果设置,将在指定的超时时间内启用搜索请求的滚动。