我应该假定URL中的编码字符应位于哪个字符集?

时间:2020-03-06 14:47:38  来源:igfitidea点击:

RFC 1738指定URL的语法,并提到

URLs are written only with the graphic
  printable characters of the

  US-ASCII coded character set. The
  octets 80-FF hexadecimal are not

  used in US-ASCII, and the octets 00-1F
  and 7F hexadecimal represent

  control characters; these must be
  encoded.

但是,它没有说这些八位位组代表什么代码。

RFC 2396似乎在尝试改善这种情况,但是:

For original character sequences that
  contain non-ASCII characters, however, the situation is more
  difficult. Internet protocols that transmit octet sequences intended to
  represent character sequences are expected to provide some way of
  identifying the charset used, if there might be more than one
  [RFC2277].  However, there is currently no provision within the
  generic URI syntax to accomplish this identification. An individual URI
  scheme may require a single charset, define a default charset, or
  provide a way to indicate the charset used.
  
  It is expected that a systematic treatment of character encoding within URI will be
  developed as a future modification of this specification.

客户端可以确定使用哪种字符集来解释编码八位位组,或者服务器可以确定客户端用来进行哪些编码的方式是否明确?

在我看来,大多数服务器都默认为UTF-8,但这实际上是一个选择,而不是指定的选择。

解决方案

根据报价,URL为ASCII。就这样。

URI OTOH,允许更大的字符集;通常是我们自己说的UTF-8.

要记住的一点是,URL是URI的子集。因此,真正的问题是,这些是我们在浏览器中编写的?我猜你可以写一个URI,浏览器应该尽力将其转换为URL(HTTP / 1.1支持,AFAICR)。对于非ASCII字符,表示十六进制代码,通常编码为UTF-8.

我相信我们正在寻找的规范是RFC 3987,它描述了IRIs国际化资源标识符。