Python IncompleteRead 使用 httplib

Question

提问by umeboshi

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.

从特定网站获取 rss 提要一直存在问题。我最终编写了一个相当丑陋的程序来执行此功能，但我很好奇为什么会发生这种情况以及是否有任何更高级别的接口正确处理此问题。这个问题并不是真正的障碍，因为我不需要经常检索提要。

I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.

我已经阅读了一个捕获异常并返回部分内容的解决方案，但是由于不完整的读取在实际检索的字节数上有所不同，我不确定这样的解决方案是否真的有效。

#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead

url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'

content = feedparser.parse(url)
if 'bozo_exception' in content:
    print content['bozo_exception']
else:
    print "Success!!"
    sys.exit(0)

print "If you see this, please tell me what happened."

# try using mechanize
b = Browser()
r = b.open(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using mechanize", e

# try using urllib2
r = urllib2.urlopen(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using urllib2", e


# try using requests
try:
    r = requests.request('GET', url)
except IncompleteRead, e:
    print "IncompleteRead using requests", e

# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to 
# learn what's happening.  Please help me put this function into
# eternal rest.
def get_rss_feed(url):
    response = urllib2.urlopen(url)
    read_it = True
    content = ''
    while read_it:
        try:
            content += response.read(1)
        except IncompleteRead:
            read_it = False
    return content, response.info()


content, info = get_rss_feed(url)

feed = feedparser.parse(content)

As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.

如前所述，这不是一个关键任务问题，而是一个好奇心，尽管我可以预期 urllib2 会有这个问题，但我很惊讶在机械化和请求中也遇到了这个错误。feedparser 模块甚至不会抛出错误，因此检查错误取决于“bozo_exception”键的存在。

Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.

编辑：我只想提一下 wget 和 curl 都可以完美地执行该功能，每次都能正确检索完整的有效负载。除了我丑陋的 hack 之外，我还没有找到一种纯 python 方法，我很想知道 httplib 后端发生了什么。顺便说一句，我决定前几天也用斜纹布试试这个，但得到了同样的 httplib 错误。

P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.

PS 有一件事也让我觉得很奇怪。IncompleteRead 始终发生在负载中的两个断点之一。似乎 feedparser 和 requests 在读取 926 个字节后失败，但 mechanize 和 urllib2 在读取 1854 个字节后失败。这种行为是一致的，我没有解释或理解。

Answer 1

采纳答案by Blair

At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplibwhich is where the exception is being thrown.

在一天结束时，所有其他模块（feedparser、mechanize、和urllib2）调用httplib这是抛出异常的地方。

Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:

现在，首先，我也用 wget 下载了这个，结果文件是 1854 字节。接下来，我尝试了urllib2：

>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
 'Content-Type: text/xml; charset=utf-8\r\n',
 'Server: Microsoft-IIS/7.5\r\n',
 'X-AspNet-Version: 4.0.30319\r\n',
 'X-Powered-By: ASP.NET\r\n',
 'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
 'Via: 1.1 BC1-ACLD\r\n',
 'Transfer-Encoding: chunked\r\n',
 'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)

So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:

所以它正在读取所有 1854 个字节，但随后认为还有更多。如果我们明确告诉它只读取 1854 个字节，它会起作用：

>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:

显然，这只有在我们总是提前知道确切长度时才有用。我们可以使用部分读取作为异常的属性返回的事实来捕获整个内容：

>>> try:
...     contents = f.read()
... except httplib.IncompleteRead as e:
...     contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

This blog postsuggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read()method with the try..exceptblock above to handle things behind the scenes:

这篇博文暗示这是服务器的故障，并描述了如何httplib.HTTPResponse.read()使用try..except上面的块对方法进行猴子修补以处理幕后的事情：

import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial

    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

I applied the patch and then feedparserworked:

我应用了补丁，然后feedparser工作：

>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
 'encoding': 'utf-8',
 'entries': ...
 'status': 200,
 'version': 'rss20'}

This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplibis mis-handling an edge case.

这不是最好的做事方式，但它似乎有效。我在 HTTP 协议方面不够专业，无法确定服务器是否做错了事情，或者是否httplib处理不当。

Answer 2

回答by Sérgio

I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:

我发现在我的情况下，发送 HTTP/1.0 请求，解决问题，只需将其添加到代码中：

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after I do the request :

在我完成请求之后：

req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()

after I back to http 1.1 with (for connections that support 1.1) :

在我回到 http 1.1 之后（对于支持 1.1 的连接）：

httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

Answer 3

回答by Abdul Majeed

I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

我已经通过使用 HTTPS 而不是 HTTP 解决了这个问题并且它工作正常。不需要更改代码。

Python IncompleteRead 使用 httplib

提问by umeboshi

采纳答案by Blair

回答by Sérgio

回答by Abdul Majeed

相关推荐

最近更新

标签

Python IncompleteRead 使用 httplib

提问by umeboshi

采纳答案by Blair

回答by Sérgio

回答by Abdul Majeed

相关推荐

如何有效地混淆 Python 代码？

Python 抛出 ValueError: list.remove(x): x not in list

Python 将由“\r\n”分隔的字符串拆分为行列表？

Python Django：列表显示管理中的外键值

相关推荐

最近更新

标签