Python elasticsearch-py 扫描并滚动以返回所有文档

Question

提问by drowningincode

I am using elasticsearch-py to connect to my ES database which contains over 3 million documents. I want to return all the documents so I can abstract data and write it to a csv. I was able to accomplish this easily for 10 documents (the default return) using the following code.

我正在使用 elasticsearch-py 连接到包含超过 300 万个文档的 ES 数据库。我想返回所有文档，以便我可以抽象数据并将其写入 csv。我能够使用以下代码轻松完成 10 个文档（默认返回）。

es=Elasticsearch("glycerin")
query={"query" : {"match_all" : {}}}
response= es.search(index="_all", doc_type="patent", body=query)

for hit in response["hits"]["hits"]:
  print hit

Unfortunately, when I attempted to implement the scan & scroll so I could get all the documents I ran into issues. I tried it two different ways with no success.

不幸的是，当我尝试实施扫描和滚动时，我可以获取遇到问题的所有文档。我尝试了两种不同的方法，但没有成功。

Method 1:

方法一：

scanResp= es.search(index="_all", doc_type="patent", body=query, search_type="scan", scroll="10m")  
scrollId= scanResp['_scroll_id']

response= es.scroll(scroll_id=scrollId, scroll= "10m")
print response

enter image description here After scroll/it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

在此处输入图片说明在scroll/它给出滚动 id 之后，然后以?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

Method 2:

方法二：

query={"query" : {"match_all" : {}}}
scanResp= helpers.scan(client= es, query=query, scroll= "10m", index="", doc_type="patent", timeout="10m")

for resp in scanResp:
    print "Hiya"

If I print out scanResp before the for loop I get <generator object scan at 0x108723dc0>. Because of this I'm relatively certain that I'm messing up my scroll somehow, but I'm not sure where or how to fix it.

如果我在 for 循环之前打印出 scanResp，我会得到<generator object scan at 0x108723dc0>. 因此，我相对确定我以某种方式弄乱了我的卷轴，但我不确定在哪里或如何修复它。

Results: enter image description here Again, after scroll/it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

结果：在此处输入图片说明再次，在scroll/它给出滚动 id 之后，然后以?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

I tried increasing the Max retries for the transport class, but that didn't make a difference.I would very much appreciate any insight into how to fix this.

我尝试增加传输类的最大重试次数，但这并没有什么不同。我非常感谢您对如何解决此问题的任何见解。

Note:My ES is located on a remote desktop on the same network.

注意：我的 ES 位于同一网络上的远程桌面上。

Answer 1

采纳答案by chrstahl89

The python scan method is generating a GET call to the rest api. It is trying to send over your scroll_id over http. The most likely case here is that your scroll_id is too large to be sent over http and so you are seeing this error because it returns no response.

python scan 方法正在生成对其余 api 的 GET 调用。它试图通过 http 发送您的 scroll_id。这里最可能的情况是您的 scroll_id 太大而无法通过 http 发送，因此您看到此错误是因为它没有返回任何响应。

Because the scroll_id grows based on the number of shards you have it is better to use a POST and send the scroll_id in JSON as part of the request. This way you get around the limitation of it being too large for an http call.

因为 scroll_id 根据您拥有的分片数量增长，所以最好使用 POST 并在 JSON 中将 scroll_id 作为请求的一部分发送。通过这种方式，您可以绕过它对于 http 调用来说太大的限制。

Answer 2

回答by zhaochl

Do you issue got resolved ?

你的问题解决了吗？

I have got one simple solution, you must change the scroll_idevery time after you call scroll method like below :

我有一个简单的解决方案，scroll_id每次调用滚动方法后都必须更改，如下所示：

response_tmp = es.scroll(scroll_id=scrollId, scroll= "1m")

scrollId = response_tmp['_scroll_id']

Python elasticsearch-py 扫描并滚动以返回所有文档

提问by drowningincode

采纳答案by chrstahl89

回答by zhaochl

相关推荐

最近更新

标签

Python elasticsearch-py 扫描并滚动以返回所有文档

提问by drowningincode

采纳答案by chrstahl89

回答by zhaochl

相关推荐

Python 基于argparse的调用函数

Python 按日期过滤 Pandas DataFrames

Python 将查询结果分配给变量

Python 如何创建指向另一个 html 页面的链接？

相关推荐

最近更新

标签