Unicode URL解码

时间:2020-03-06 14:57:26  来源:igfitidea点击:

URL编码unicode字符的常用方法是将其拆分为2%HH代码。 (\ u4161 =>%41%61)

但是,解码时如何区分unicode?我们怎么知道%41%61是\ u4161与\ x41 \ x61(" Aa")?

需要编码的8位字符是否以%00开头?

还是说Unicode字符应该丢失/拆分?

解决方案

根据维基百科:

Current standard
  
  The generic URI syntax mandates that new URI schemes
  that provide for the representation of
  character data in a URI must, in
  effect, represent characters from the
  unreserved set without translation,
  and should convert all other
  characters to bytes according to
  UTF-8, and then percent-encode those
  values. This requirement was
  introduced in January 2005 with the
  publication of RFC 3986. URI schemes
  introduced before this date are not
  affected.
  
  Not addressed by the current
  specification is what to do with
  encoded character data. For example,
  in computers, character data manifests
  in encoded form, at some level, and
  thus could be treated as either binary
  data or as character data when being
  mapped to URI characters. Presumably,
  it is up to the URI scheme
  specifications to account for this
  possibility and require one or the
  other, but in practice, few, if any,
  actually do.
  
  Non-standard implementations
  
  There exists a non-standard encoding
  for Unicode characters: %uxxxx, where
  xxxx is a Unicode value represented as
  four hexadecimal digits. This behavior
  is not specified by any RFC and has
  been rejected by the W3C. The third
  edition of ECMA-262 still includes an
  escape(string) function that uses this
  syntax, but also an encodeURI(uri)
  function that converts to UTF-8 and
  percent-encodes each octet.

因此,看起来完全由编写unencode方法的人来决定...不是很有趣吗?

我一直做的是先将UTF-8编码为Unicode字符串,使其成为一系列8位字符,然后再使用%HH对其进行转义。

P.S.我只能希望非标准实现(%uxxxx)很少而且相差甚远。

由于URI是在unicode出现之前或者至少被广泛使用之前引入的,因此我认为这是一个非常特定于实现的问题。 UTF-8对文本进行编码,然后按照正常的声音转义听起来是最好的主意,因为它可以与任何现有的ASCII / ANSI系统完全向后兼容,尽管我们可能会得到一两个奇怪的奇怪字符。

另一方面,要解码,我们需要取消转义文本,并获得UTF-8字符串。如果使用较旧系统的某人尝试以ASCII / ANSI格式发送一些数据,则不会造成任何危害,这已经(几乎)是UTF-8编码的。