如何使用 Python urlopen 获取非 ascii url？

Question

提问by onurmatik

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:

我需要从非 ascii 字符的 URL 中获取数据，但 urllib2.urlopen 拒绝打开资源并引发：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)

I know the URL is not standards compliant but I have no chance to change it.

我知道 URL 不符合标准，但我没有机会更改它。

What is the way to access a resource pointed by a URL containing non-ascii characters using Python?

使用 Python 访问包含非 ascii 字符的 URL 指向的资源的方法是什么？

edit:In other words, can / how urlopen open a URL like:

编辑：换句话说，可以/如何 urlopen 打开一个 URL，如：

http://example.org/???-?????/

Answer 1

采纳答案by bobince

Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.

严格来说，URI 不能包含非 ASCII 字符；你所拥有的是一个IRI。

To convert an IRI to a plain ASCII URI:

要将 IRI 转换为纯 ASCII URI：

non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.

地址的主机名部分中的非 ASCII 字符必须使用基于Punycode的 IDNA 算法进行编码；
根据 Ignacio 的回答，路径中的非 ASCII 字符以及地址的大多数其他部分必须使用 UTF-8 和 %-encoding 进行编码。

So:

所以：

import re, urlparse

def urlEncodeNonAscii(b):
    return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)

def iriToUri(iri):
    parts= urlparse.urlparse(iri)
    return urlparse.urlunparse(
        part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
        for parti, part in enumerate(parts)
    )

>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'

(Technically this still isn't quite good enough in the general case because urlparsedoesn't split away any user:pass@prefix or :portsuffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quoteand .encode('idna')at the time you're constructing a URL than to have to pull an IRI apart.)

（从技术上讲，这在一般情况下仍然不够好，因为urlparse不会拆分主机名上的任何user:pass@前缀或:port后缀。只有主机名部分应该是 IDNA 编码的。使用普通urllib.quote和.encode('idna')在你使用的时候更容易编码构建一个 URL 而不是必须将 IRI 分开。）

Answer 2

回答by Ignacio Vazquez-Abrams

Encode the unicodeto UTF-8, then URL-encode.

将编码unicode为 UTF-8，然后进行 URL 编码。

Answer 3

回答by eviltnan

Use iri2urimethod of httplib2. It makes the same thing as by bobin (is he/she the author of that?)

的使用iri2uri方法httplib2。它和 bobin 做的一样（他/她是那个的作者吗？）

Answer 4

回答by darkfeline

Python 3 has libraries to handle this situation. Use urllib.parse.urlsplitto split the URL into its components, and urllib.parse.quoteto properly quote/escape the unicode characters and urllib.parse.urlunsplitto join it back together.

Python 3 有处理这种情况的库。使用 urllib.parse.urlsplit的URL分割成其组成部分，并 urllib.parse.quote妥善报价/逃脱Unicode字符和urllib.parse.urlunsplit加入它重新走到一起。

>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8

Answer 5

回答by Perry

In python3, use the urllib.parse.quotefunction on the non-ascii string:

在python3中，urllib.parse.quote对非ascii字符串使用该函数：

>>> from urllib.request import urlopen                                                                                                                                                            
>>> from urllib.parse import quote                                                                                                                                                                
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)

Answer 6

回答by h7r

For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".

对于那些不严格依赖 urllib 的人，一种实用的替代方法是requests，它“开箱即用”处理 IRI。

For example, with http://bücher.ch:

例如，使用http://bücher.ch：

>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200

Answer 7

回答by Mikhail Korobov

It is more complex than the accepted @bobince's answer suggests:

它比公认的@bobince 的答案所暗示的要复杂：

netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.

netloc 应使用 IDNA 进行编码；
非 ascii URL 路径应编码为 UTF-8，然后进行百分比转义；
非 ascii 查询参数应编码为从页面 URL 中提取的编码（或编码服务器使用），然后进行百分比转义。

This is how all browsers work; it is specified in https://url.spec.whatwg.org/- see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:

这就是所有浏览器的工作方式；它在https://url.spec.whatwg.org/ 中指定- 请参阅此示例。可以在 w3lib 中找到 Python 实现（这是 Scrapy 正在使用的库）；见w3lib.url.safe_url_string：

from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/???-?????/', encoding="<page encoding>")

An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.

检查 URL 转义实现是否不正确/不完整的一种简单方法是检查它是否提供“页面编码”参数。

Answer 8

回答by Ukr

Based on @darkfeline answer:

基于@darkfeline 的回答：

from urllib.parse import urlsplit, urlunsplit, quote

def iri2uri(iri):
    """
    Convert an IRI to a URI (Python 3).
    """
    uri = ''
    if isinstance(iri, str):
        (scheme, netloc, path, query, fragment) = urlsplit(iri)
        scheme = quote(scheme)
        netloc = netloc.encode('idna').decode('utf-8')
        path = quote(path)
        query = quote(query)
        fragment = quote(fragment)
        uri = urlunsplit((scheme, netloc, path, query, fragment))

    return uri

Answer 9

回答by Giovanni G. PY

works! finally

作品！最后

I could not avoid from this strange characters, but at the end I come through it.

我无法避免这些奇怪的字符，但最终我还是通过了它。

import urllib.request
import os


url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
    html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
    file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

如何使用 Python urlopen 获取非 ascii url？

提问by onurmatik

采纳答案by bobince

回答by Ignacio Vazquez-Abrams

回答by eviltnan

回答by darkfeline

回答by Perry

回答by h7r

回答by Mikhail Korobov

回答by Ukr

回答by Giovanni G. PY

works! finally

作品！最后

相关推荐

最近更新

标签

如何使用 Python urlopen 获取非 ascii url？

提问by onurmatik

采纳答案by bobince

回答by Ignacio Vazquez-Abrams

回答by eviltnan

回答by darkfeline

回答by Perry

回答by h7r

回答by Mikhail Korobov

回答by Ukr

回答by Giovanni G. PY

works! finally

作品！最后

相关推荐

Python 中 scipy/numpy 中的 exp 溢出？

python + sqlite，将变量中的数据插入表中

Python 如何检查两个日期之间的差异（以秒为单位）？

在 Python 中对数字列表求和

相关推荐

最近更新

标签