java 在 Tomcat 上处理 URI 中的字符编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1233076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 15:45:09  来源:igfitidea点击:

Handling Character Encoding in URI on Tomcat

javatomcatencodingservletsinternationalization

提问by ZZ Coder

On the web site I am trying to help with, user can type in an URL in the browser, like following Chinese characters,

在我试图帮助的网站上,用户可以在浏览器中输入一个 URL,比如跟随汉字,

  http://localhost:8080?a=测试

On server, we get

在服务器上,我们得到

  GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1

As you can see, it's UTF-8 encoded, then URL encoded. We can handle this correctly by setting encoding to UTF-8 in Tomcat.

如您所见,它是 UTF-8 编码的,然后是 URL 编码的。我们可以通过在 Tomcat 中将编码设置为 UTF-8 来正确处理这个问题。

However, sometimes we get Latin1 encoding on certain browsers,

但是,有时我们会在某些浏览器上获得 Latin1 编码,

  http://localhost:8080?a=?

turns into

变成

  GET /a=%DF HTTP/1.1

Is there anyway to handle this correctly in Tomcat? Looks like the server has to do some intelligent guessing. We don't expect to handle the Latin1 correctly 100% but anything is better than what we are doing now by assuming everything is UTF-8.

无论如何在Tomcat中正确处理这个问题?看起来服务器必须做一些智能猜测。我们不希望 100% 正确处理 Latin1,但假设一切都是 UTF-8,任何事情都比我们现在所做的要好。

The server is Tomcat 5.5. The supported browsers are IE 6+, Firefox 2+ and Safari on iPhone.

服务器是Tomcat 5.5。支持的浏览器是 IE 6+、Firefox 2+ 和 iPhone 上的 Safari。

采纳答案by kdgregory

Unfortunately, UTF-8 encoding is a "should" in the URI specification, which seems to assume that the origin server will generate all URLs in such a way that they will be meaningful to the destination server.

不幸的是,UTF-8 编码是URI 规范中的一个“应该” ,它似乎假设源服务器将以它们对目标服务器有意义的方式生成所有 URL。

There are a couple of techniques that I would consider; all involve parsing the query string yourself (although you may know better than I whether setting the request encoding affects the query string to parameter mapping or just the body).

我会考虑几种技术;所有都涉及自己解析查询字符串(尽管您可能比我更清楚设置请求编码是影响查询字符串到参数映射还是仅影响正文)。

First, examine the query string for single "high-bytes": a valid UTF-8 sequence must have two or more bytes (the Wikipedia entryhas a nice table of valid and invalid bytes).

首先,检查单个“高字节”的查询字符串:有效的 UTF-8 序列必须有两个或更多字节(维基百科条目有一个很好的有效和无效字节表)。

Less reliable would be to look a the "Accept-Charset" header in the request. I don't think this header is required (haven't looked at the HTTP spec to verify), and I know that Firefox, at least, will send a whole list of acceptable values. Picking the first value in the list might work, or it might not.

不太可靠的是查看请求中的“Accept-Charset”标头。我不认为这个标头是必需的(还没有查看 HTTP 规范来验证),而且我知道 Firefox 至少会发送一个完整的可接受值列表。选择列表中的第一个值可能有效,也可能无效。

Finally, have you done any analysis on the logs, to see if a particular user-agent will consistently use this encoding?

最后,您是否对日志进行了任何分析,以查看特定用户代理是否会始终使用此编码?