Unicode URL解码
时间:2020-03-06 14:57:26 来源:igfitidea点击:
URL编码unicode字符的常用方法是将其拆分为2%HH代码。 (\ u4161 =>%41%61)
但是,解码时如何区分unicode?我们怎么知道%41%61是\ u4161与\ x41 \ x61(" Aa")?
需要编码的8位字符是否以%00开头?
还是说Unicode字符应该丢失/拆分?
解决方案
根据维基百科:
Current standard The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected. Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do. Non-standard implementations There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.
因此,看起来完全由编写unencode方法的人来决定...不是很有趣吗?
我一直做的是先将UTF-8编码为Unicode字符串,使其成为一系列8位字符,然后再使用%HH对其进行转义。
P.S.我只能希望非标准实现(%uxxxx)很少而且相差甚远。
由于URI是在unicode出现之前或者至少被广泛使用之前引入的,因此我认为这是一个非常特定于实现的问题。 UTF-8对文本进行编码,然后按照正常的声音转义听起来是最好的主意,因为它可以与任何现有的ASCII / ANSI系统完全向后兼容,尽管我们可能会得到一两个奇怪的奇怪字符。
另一方面,要解码,我们需要取消转义文本,并获得UTF-8字符串。如果使用较旧系统的某人尝试以ASCII / ANSI格式发送一些数据,则不会造成任何危害,这已经(几乎)是UTF-8编码的。