使用 Python 进行 URL 编码/解码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3563126/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:44:21  来源:igfitidea点击:

URL encoding/decoding with Python

pythonurl-encoding

提问by Joey

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:

我正在尝试在 Python 中编码、存储和解码参数,但在此过程中迷路了。这是我的步骤:

1) I use google toolkit's gtm_stringByEscapingForURLArgumentto convert an NSString properly for passing into HTTP arguments.

1)我使用谷歌工具包gtm_stringByEscapingForURLArgument正确转换 NSString 以传递到 HTTP 参数。

2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\''(note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \uand \xchars in there being some monetary prefixes like pound, yen, etc)

2)在我的服务器(python)上,我将这些字符串参数存储为类似u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\''(请注意,这些是“123”视图和“#+=”视图中iphone键盘上的标准键,其中的\u\x字符一些货币前缀,如英镑、日元等)

3) I call urllib.quote(myString,'')on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.

3)我调用urllib.quote(myString,'')该存储值,大概是为了将它们 %-escape 以传输到客户端,以便客户端可以完全逃脱它们。

The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?

结果是当我尝试记录 % 转义的结果时出现异常。是否有一些我忽略的关键步骤需要应用于 \u 和 \x 格式的存储值,以便正确转换它以通过 http 发送?

Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.

更新:标记为以下答案的建议对我有用。不过,我正在提供一些更新以解决以下评论的完整问题。

The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.

我收到的例外引用了\u20ac. 我不知道这是否是一个问题,而不是它是字符串中的第一个 unicode 字符的事实。

That \u20acchar is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quotemethod.

\u20ac字符是“欧元”符号的 unicode。我基本上发现除非我使用 urllib2quote方法,否则我会遇到问题。

采纳答案by pycruft

url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8")first so you have a known byte encoding and then .quote()that.

url 编码“原始”unicode 并没有真正意义。你需要做的是.encode("utf8")首先让你有一个已知的字节编码,然后.quote()是。

The output isn't very pretty but it should be a correct uri encoding.

输出不是很漂亮,但它应该是正确的 uri 编码。

>>> s = u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'

Remember that you will need to both unquote()and decode()this to print it out properly if you're debugging or whatever.

请记住,如果您正在调试或其他任何事情,您将需要同时使用unquote()decode()this 才能正确打印出来。

>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>a???£?¥a¢.,?!'
>>> # oops, nasty ? means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>£¥?.,?!'

This is, in fact, what the django functionsmentioned in another answer do.

事实上,这就是另一个答案中提到的django 函数所做的。

The functions django.utils.http.urlquote() and django.utils.http.urlquote_plus() are versions of Python's standard urllib.quote() and urllib.quote_plus() that work with non-ASCII characters. (The data is converted to UTF-8 prior to encoding.)

函数 django.utils.http.urlquote() 和 django.utils.http.urlquote_plus() 是 Python 标准 urllib.quote() 和 urllib.quote_plus() 的版本,它们使用非 ASCII 字符。(数据在编码之前被转换为 UTF-8。)

Be careful if you are applying any further quotes or encodings not to mangle things.

如果您要应用任何进一步的引号或编码来避免破坏事物,请务必小心。

回答by almir karic

You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode

你对 stdlib 不走运,urllib.quote 不适用于 unicode。如果您使用的是 django,您可以使用 django.utils.http.urlquote 与 unicode 一起正常工作

回答by flow

i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.

我想附上 pycruft 的评论。网络协议已经发展了几十年,处理各种约定可能很麻烦。现在 URL 碰巧没有为字符明确定义,而只为字节(八位字节)定义。作为历史巧合,URL 是您只能假设但不能强制或安全地期望存在编码的地方之一。但是,这里有一个约定比其他编码更喜欢 latin-1 和 utf-8。有一段时间,看起来“ Unicode百分比转义”将成为未来,但它们从未流行起来。

it is of paramount importance to be pedantically picky in this area about the difference between unicodeobjects and octet strings (in Python < 3.0; that's, confusingly, strunicode objects and bytes/bytearrayobjects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.

在这方面对unicode对象和八位字节之间的区别str(在 Python < 3.0 中;也就是说,令人困惑的是,Python >= 3.0 中的strunicode 对象和bytes/bytearray对象)之间的区别进行迂腐挑剔是至关重要的。不幸的是,根据我的经验,由于多种原因,很难在 Python 2.x 中完全分离这两个概念。

even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxxescape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.

更糟糕的是,当你想接收第三方 HTTP 请求时,你不能绝对依赖以百分比转义、utf-8 编码的八位字节发送的 URL:那里可能偶尔会出现%uxxxx转义,至少是 firefox 2.x 用于在可能的情况下将 URL 编码为 latin-1,并仅在必要时编码为 utf-8。