python 如何在python中下载具有正确字符集的任何(!)网页?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1495627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to download any(!) webpage with correct charset in python?
提问by Tarnay Kálmán
Problem
问题
When screen-scraping a webpage using python one has to know the character encoding of the page.If you get the character encoding wrong than your output will be messed up.
使用 python 抓取网页时,必须知道页面的字符编码。如果您的字符编码错误,那么您的输出就会混乱。
People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector(which does not care about meta tags or headers). By using only one these techniques, sometimes you will not get the same result as you would in a browser.
人们通常使用一些基本的技术来检测编码。他们要么使用标题中的字符集或元标记中定义的字符集,要么使用编码检测器(不关心元标记或标题)。仅使用这些技术中的一种,有时您将无法获得与在浏览器中相同的结果。
Browsers do it this way:
浏览器这样做:
- Meta tags always takes precedence (or xml definition)
- Encoding defined in the header is used when there is no charset defined in a meta tag
- If the encoding is not defined at all, than it is time for encoding detection.
- 元标记始终优先(或 xml 定义)
- 当元标记中没有定义字符集时,使用标头中定义的编码
- 如果根本没有定义编码,那么就该进行编码检测了。
(Well... at least that is the way I believe most browsers do it. Documentation is really scarce.)
(嗯……至少我相信大多数浏览器都是这样做的。文档真的很稀缺。)
What I'm looking for is a library that can decide the character set of a page the way a browser would.I'm sure I'm not the first who needs a proper solution to this problem.
我正在寻找的是一个可以像浏览器那样决定页面字符集的库。我敢肯定,我不是第一个需要适当解决此问题的人。
Solution (I have not tried it yet...)
解决方案(我还没试过...)
According to Beautiful Soup's documentation.
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
Beautiful Soup 按优先级尝试以下编码,将您的文档转换为 Unicode:
- An encoding you pass in as the fromEncoding argument to the soup constructor.
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
- 您作为 fromEncoding 参数传递给汤构造函数的编码。
- 在文档本身中发现的一种编码:例如,在 XML 声明或(对于 HTML 文档)http-equiv META 标签中。如果 Beautiful Soup 在文档中发现这种编码,它会从头开始重新解析文档并尝试新的编码。唯一的例外是,如果您明确指定了一种编码,并且该编码确实有效:那么它将忽略它在文档中找到的任何编码。
- 通过查看文件的前几个字节来嗅探的编码。如果在此阶段检测到编码,它将是 UTF-* 编码、EBCDIC 或 ASCII 之一。
- 由 chardet 库嗅探的编码,如果你安装了它。
- UTF-8
- 视窗-1252
回答by Martin v. L?wis
When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:
当您使用 urllib 或 urllib2 下载文件时,您可以查明是否传输了字符集标头:
fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')
You can use BeautifulSoup to locate a meta element in the HTML:
您可以使用 BeautifulSoup 在 HTML 中定位元元素:
soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.
如果两者都不可用,浏览器通常会回退到用户配置,并结合自动检测。正如 rajax 所建议的,您可以使用 chardet 模块。如果您有可用的用户配置告诉您页面应该是中文(比如说),您可能会做得更好。
回答by rajax
Use the Universal Encoding Detector:
使用通用编码检测器:
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
The other option would be to just use wget:
另一种选择是只使用 wget:
import os
h = os.popen('wget -q -O foo1.txt http://foo.html')
h.close()
s = open('foo1.txt').read()
回答by Gareth Simpson
It seems like you need a hybrid of the answers presented:
似乎您需要混合提供的答案:
- Fetch the page using urllib
- Find
<meta>
tags using beautiful soup or other method - If no meta tags exist, check the headers returned by urllib
- If that still doesn't give you an answer, use the universal encoding detector.
- 使用 urllib 获取页面
<meta>
使用美汤或其他方法查找标签- 如果不存在元标记,请检查 urllib 返回的标头
- 如果这仍然没有给您答案,请使用通用编码检测器。
I honestly don't believe you're going to find anything better than that.
老实说,我不相信你会找到比这更好的东西。
In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.
事实上,如果您进一步阅读在另一个答案的评论中链接到的常见问题解答,这就是检测器库的作者所提倡的。
If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.
如果您相信常见问题解答,这就是浏览器所做的(按照您原始问题的要求),因为检测器是 Firefox 嗅探代码的端口。
回答by Mikhail Korobov
Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta>
tags, BOM marks and differences in encoding names in account.
Scrapy 下载一个页面并检测它的正确编码,这与 requests.get(url).text 或 urlopen 不同。为此,它会尝试遵循类似浏览器的规则——这是最好的做法,因为网站所有者有动力让他们的网站在浏览器中运行。Scrapy 需要考虑 HTTP 标头、<meta>
标签、BOM 标记和编码名称的差异。
Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta>
or BOM marks are not available or provide no information.
基于内容的猜测(chardet、UnicodeDammit)本身并不是一个正确的解决方案,因为它可能会失败;仅当标题<meta>
或 BOM 标记不可用或不提供任何信息时,才应将其用作最后的手段。
You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.
您不必使用 Scrapy 来获取其编码检测功能;它们在一个名为 w3lib 的单独库中发布(以及其他一些东西):https: //github.com/scrapy/w3lib。
To get page encoding and unicode body use w3lib.encoding.html_to_unicodefunction, with a content-based guessing fallback:
要获取页面编码和 unicode 正文,请使用w3lib.encoding.html_to_unicode函数,并带有基于内容的猜测回退:
import chardet
from w3lib.encoding import html_to_unicode
def _guess_encoding(data):
return chardet.detect(data).get('encoding')
detected_encoding, html_content_unicode = html_to_unicode(
content_type_header,
html_content_bytes,
default_encoding='utf8',
auto_detect_fun=_guess_encoding,
)
回答by AlexCV
BeautifulSoup dose this with UnicodeDammit : Unicode, Dammit
BeautifulSoup 使用 UnicodeDammit 来处理这个:Unicode,Dammit
回答by Ravi
instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..
与其尝试获取页面然后找出浏览器将使用的字符集,为什么不直接使用浏览器来获取页面并检查它使用的字符集..
from win32com.client import DispatchWithEvents
import threading
stopEvent=threading.Event()
class EventHandler(object):
def OnDownloadBegin(self):
pass
def waitUntilReady(ie):
"""
copypasted from
http://mail.python.org/pipermail/python-win32/2004-June/002040.html
"""
if ie.ReadyState!=4:
while 1:
print "waiting"
pythoncom.PumpWaitingMessages()
stopEvent.wait(.2)
if stopEvent.isSet() or ie.ReadyState==4:
stopEvent.clear()
break;
ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet