Java 如何正确解码传递给 servlet 的 unicode 参数

Question

提问by Grant Wagner

Suppose I have:

假设我有：

<a href="http://www.yahoo.com/" target="_yahoo" 
    title="Yahoo!&#8482;" onclick="return gateway(this);">Yahoo!</a>
<script type="text/javascript">
function gateway(lnk) {
    window.open(SERVLET +
        '?external_link=' + encodeURIComponent(lnk.href) +
        '&external_target=' + encodeURIComponent(lnk.target) +
        '&external_title=' + encodeURIComponent(lnk.title));
    return false;
}
</script>

I have confirmed external_titlegets encoded as Yahoo!%E2%84%A2and passed to SERVLET. If in SERVLETI do:

我已经确认external_title被编码为Yahoo!%E2%84%A2并传递给SERVLET. 如果SERVLET我这样做：

Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

I get Yahoo!a?￠in the browser. If I manually switch the browser character encoding to UTF-8, it changes to Yahoo!^TM(which is what I want).

我在浏览器中看到Yahoo!a?￠。如果我手动将浏览器字符编码切换为 UTF-8，它会更改为Yahoo! ^TM（这是我想要的）。

So I figured the encoding I was sending to the browser was wrong (it was Content-type: text/html; charset=ISO-8859-1). I changed SERVLETto:

所以我认为我发送到浏览器的编码是错误的（它是Content-type: text/html; charset=ISO-8859-1）。我改为SERVLET：

response.setContentType("text/html; charset=utf-8");
Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

Now the browser character encoding is UTF-8, but it outputs Yahoo!a￠and I can't get the browser to render the correct character at all.

现在浏览器的字符编码是 UTF-8，但它输出Yahoo!a￠，我根本无法让浏览器呈现正确的字符。

My question is: is there some combination of Content-typeand/or new String(request.getParameter("external_title").getBytes(), "UTF-8");and/or something else that will result in Yahoo!^TMappearing in the SERVLEToutput?

我的问题是：是否有一些Content-type和/或new String(request.getParameter("external_title").getBytes(), "UTF-8");和/或其他东西的组合会导致雅虎！^TM出现在SERVLET输出中？

Answer 1

采纳答案by bobince

You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.

你快到了。EncodeURIComponent 正确编码为 UTF-8，这是您今天在 URL 中应该始终使用的。

The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.

问题是提交的查询字符串在进入服务器端脚本的过程中被破坏了，因为 getParameter() 使用 ISO-8559-1 而不是 UTF-8。这源于远古时代，在网络为 URI/IRI 确定 UTF-8 之前，但令人遗憾的是 Servlet 规范尚未更新以匹配现实，或者至少为其提供可靠的、受支持的选项。

(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)

（在 Servlet 2.3 中有 request.setCharacterEncoding，但它不影响查询字符串解析，如果之前已经读取了单个参数，可能被其他框架元素读取，它根本不会工作。）

So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see http://wiki.apache.org/tomcat/FAQ/CharacterEncodingand also What's the difference between "URIEncoding" of Tomcat, Encoding Filter and request.setCharacterEncoding.

因此，您需要使用特定于容器的方法来获得正确的 UTF-8，这通常涉及 server.xml 中的内容。这对于分发应该可以在任何地方工作的网络应用程序来说非常糟糕。对于 Tomcat，请参阅http://wiki.apache.org/tomcat/FAQ/CharacterEncoding以及Tomcat、编码过滤器和 request.setCharacterEncoding 的“URIEncoding”之间的区别是什么。

Answer 2

回答by jacobangel

You could always use javascript to manipulate the text further.

您始终可以使用 javascript 进一步操作文本。

<div id="test">a</div>
<script>
var a = document.getElementById('test');
alert(a.innerHTML);
a.innerHTML = decodeURI("Yahoo!%E2%84%A2");
alert(a.innerHTML);
</script>

Answer 3

回答by Michael Borgwardt

I suspect that the data mutilation happens in the request, i.e. the declared encoding of the request does not match the one that is actually used for the data.

我怀疑数据损坏发生在请求中，即请求的声明编码与实际用于数据的编码不匹配。

What does request.getCharacterEncoding()return?

什么request.getCharacterEncoding()回报？

I don't really know how JavaScript handles encodings or how to make it use a specific one.

我真的不知道 JavaScript 如何处理编码或如何让它使用特定的编码。

You need to make sure that encodings are used correctly at all stages - do NOT try to "fix" the data by using new String()an getBytes()at a point where it has already been encoded incorrectly.

你需要确保编码在各个阶段正确使用-不要通过尝试“修复”数据new String()的getBytes()在它已经被正确编码点。

Edit:It may help to have the origin page (the one with the Javascript) also encoded in UTF-8 and declared as such in its Content-Type. Then I believe Javascript may default to using UTF-8 for its request - but this is not definite knowledge, just guesswork.

编辑：将原始页面（带有 Javascript 的页面）也用 UTF-8 编码并在其 Content-Type 中声明可能会有所帮助。然后我相信 Javascript 可能会默认使用 UTF-8 作为其请求 - 但这不是确定的知识，只是猜测。

Answer 4

回答by Grant Wagner

I think I can get the following to work:

我想我可以得到以下工作：

encodeURIComponent(escape(lnk.title))

That gives me %25u2122(for &#8482) or %25AE(for &#174), which will decode to %u2122and %AErespectively in the servlet.

这给了我%25u2122（对于 ™）或%25AE（对于 ®），它们将分别解码到%u2122和%AEservlet。

I should then be able to turn %u2122 into '\u2122'and %AE into '\u00AE'relatively easily using (char) (base-10 integer value of %uXXXX or %XX)in a match and replace loop using regular expressions.

然后我应该能够将 %u2122'\u2122'和 %AE'\u00AE'相对容易地(char) (base-10 integer value of %uXXXX or %XX)在匹配中使用，并使用正则表达式替换循环。

i.e. - match /%u([0-9a-f]{4})/i, extract the matching subexpression, convert it to base-10, turn it into a char and append it to the output, then do the same with /%([0-9a-f]{2})/i

即 - match /%u([0-9a-f]{4})/i，提取匹配的子表达式，将其转换为 base-10，将其转换为字符并将其附加到输出，然后执行相同的操作/%([0-9a-f]{2})/i

Answer 5

回答by Modi

I got the same problem and solved it by decoding Request.getQueryString()using URLDecoder(), and after extracting my parameters.

我遇到了同样的问题，并通过Request.getQueryString()使用 URLDecoder()解码并在提取我的参数后解决了它。

String[] Parameters = URLDecoder.decode(Request.getQueryString(), 'UTF-8')
                       .splitat('&');

Answer 6

回答by Mr_and_Mrs_D

There is way to do it in java (no fiddling with server.xml)

有办法在java中做到这一点（不用摆弄server.xml）

Do not work :

不工作：

protected static final String CHARSET_FOR_URL_ENCODING = "UTF-8";

String uname = request.getParameter("name");
System.out.println(uname);
// ??·?3?????·
uname = request.getQueryString();
System.out.println(uname);
// name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
uname = URLDecoder.decode(request.getParameter("name"),
        CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ??·?3?????· // !!!!!!!!!!!!!!!!!!!!!!!!!!!
uname = URLDecoder.decode(
        "name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7",
        CHARSET_FOR_URL_ENCODING);
System.out.println("query string decoded : " + uname);
// query string decoded : name=τηγρτσ?η
uname = URLDecoder.decode(new String(request.getParameter("name")
        .getBytes()), CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ??·?3?????· // !!!!!!!!!!!!!!!!!!!!!!!!!!!

~~Works~~:

作品：

final String name = URLDecoder
        .decode(new String(request.getParameter("name").getBytes(
                "iso-8859-1")), CHARSET_FOR_URL_ENCODING);
System.out.println(name);
// τηγρτσ?η

Worked but will break if default encoding != utf-8- try this instead (omit the call to decode() it's not needed):

工作，但如果默认编码 != utf-8 会中断- 试试这个（省略对 decode() 的调用，它不需要）：

final String name = new String(request.getParameter("name").getBytes("iso-8859-1"),
        CHARSET_FOR_URL_ENCODING);

As I said above if the server.xmlis messed with as in :

正如我上面所说，如果server.xml被搞乱了：

<Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1"
                     redirectPort="8443"  URIEncoding="UTF-8"/>

(notice the URIEncoding="UTF-8") the code above will break (cause the getBytes("iso-8859-1")should read getBytes("UTF-8")). So for a bullet proof solution you have to get the value of the URIEncodingattribute. This unfortunately seems to be container specific - even worse container version specific. For tomcat 7 you'd need something like :

（注意URIEncoding="UTF-8"）上面的代码会中断（因为getBytes("iso-8859-1")应该读为getBytes("UTF-8")）。因此，对于防弹解决方案，您必须获取URIEncoding属性的值。不幸的是，这似乎是特定于容器的 - 更糟糕的是特定于容器版本。对于 tomcat 7，你需要类似的东西：

import javax.management.AttributeNotFoundException;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanException;
import javax.management.MBeanServer;
import javax.management.MBeanServerFactory;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.ReflectionException;

import org.apache.catalina.Server;
import org.apache.catalina.Service;
import org.apache.catalina.connector.Connector;

public class Controller extends HttpServlet {

    // ...
    static String CHARSET_FOR_URI_ENCODING; // the `URIEncoding` attribute
    static {
        MBeanServer mBeanServer = MBeanServerFactory.findMBeanServer(null).get(
            0);
        ObjectName name = null;
        try {
            name = new ObjectName("Catalina", "type", "Server");
        } catch (MalformedObjectNameException e1) {
            e1.printStackTrace();
        }
        Server server = null;
        try {
            server = (Server) mBeanServer.getAttribute(name, "managedResource");
        } catch (AttributeNotFoundException | InstanceNotFoundException
                | MBeanException | ReflectionException e) {
            e.printStackTrace();
        }
        Service[] services = server.findServices();
        for (Service service : services) {
            for (Connector connector : service.findConnectors()) {
                System.out.println(connector);
                String uriEncoding = connector.getURIEncoding();
                System.out.println("URIEncoding : " + uriEncoding);
                boolean use = connector.getUseBodyEncodingForURI();
                // TODO : if(use && connector.get uri enc...)
                CHARSET_FOR_URI_ENCODING = uriEncoding;
                // ProtocolHandler protocolHandler = connector
                // .getProtocolHandler();
                // if (protocolHandler instanceof Http11Protocol
                // || protocolHandler instanceof Http11AprProtocol
                // || protocolHandler instanceof Http11NioProtocol) {
                // int serverPort = connector.getPort();
                // System.out.println("HTTP Port: " + connector.getPort());
                // }
            }
        }
    }
}

And still you need to tweak this for multiple connectors (check the commented out parts). Then you would use something like :

而且您仍然需要为多个连接器调整它（检查注释掉的部分）。然后你会使用类似的东西：

new String(parameter.getBytes(CHARSET_FOR_URI_ENCODING), CHARSET_FOR_URL_ENCODING);

Still this may fail (IIUC) if parameter = request.getParameter("name");decoded with CHARSET_FOR_URI_ENCODING was corrupted so the bytes I get with getBytes() were not the original ones (that's why "iso-8859-1" is used by default - it will preserve the bytes). You can get rid of it all by manually parsing the query string in the lines of:

如果使用 CHARSET_FOR_URI_ENCODING 解码已损坏，这仍然可能失败（IIUC）parameter = request.getParameter("name");，因此我使用 getBytes() 获得的字节不是原始字节（这就是默认使用“iso-8859-1”的原因 -它会保留字节）。您可以通过手动解析以下行中的查询字符串来摆脱它：

URLDecoder.decode(request.getQueryString().split("=")[1],
        CHARSET_FOR_URL_ENCODING);

_{I am still looking for the place in the docs where it is mentioned that request.getParameter("name")does call URLDecoder.decode()instead of returning the %CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7string ? A link in the source would be much appreciated.
Also how can I pass as the parameter's value the string, say, %CE?=> see comment : parameter=%25CE}

_{我仍在寻找文档中提到request.getParameter("name")调用URLDecoder.decode()而不是返回%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7字符串的地方？源中的链接将不胜感激。
另外，我如何将字符串作为参数的值传递，例如，%CE？=> 见评论： parameter=%25CE}

Answer 7

回答by Ben B

There is a bug in certain versions of Jetty that makes it parse higher number UTF-8 characters incorrectly. If your server accepts arabic letters correctly but not emoji, that's a sign you have a version with this problem, since arabic is not in ISO-8859-1, but is in the lower range of UTF-8 characters ("lower" meaning java will represent it in a single char).

某些版本的 Jetty 中存在一个错误，导致它错误地解析更多数量的 UTF-8 字符。如果您的服务器正确接受阿拉伯字母但不接受表情符号，则表明您的版本存在此问题，因为阿拉伯语不在 ISO-8859-1 中，而是在 UTF-8 字符的较低范围内（“较低”表示 java将用单个字符表示）。

I updated from version 7.2.0.v20101020 to version 7.5.4.v20111024 and this fixed the problem; I can now use the getParameter(String) method instead of having to parse it myself.

我从版本 7.2.0.v20101020 更新到版本 7.5.4.v20111024，这解决了问题；我现在可以使用 getParameter(String) 方法而不必自己解析它。

If you're really curious, you can dig into your version of org.eclipse.jetty.util.Utf8StringBuilder.append(byte) and see whether it correctly adds multiple chars to the string when the utf-8 code is high enough or if, as in 7.2.0, it simply casts an int to a char and appends.

如果你真的很好奇，你可以深入研究你的 org.eclipse.jetty.util.Utf8StringBuilder.append(byte) 版本，看看当 utf-8 代码足够高或者如果，就像在 7.2.0 中一样，它只是将 int 转换为 char 并附加。

Answer 8

回答by Aung Aung

Thanks for all I get to know about encoding decoding of default character set that use in tomcat, jetty I use this method to solve my problems using google guava

感谢我所了解的关于在 tomcat、jetty 中使用的默认字符集的编码解码我使用这种方法来解决我使用谷歌番石榴的问题

        String str = URLDecoder.decode(request.getQueryString(), StandardCharsets.UTF_8.name());
        final Map<String, String> map = Splitter.on('&').trimResults().withKeyValueSeparator("=").split(str);
        System.out.println(map);
        System.out.println(map.get("aung"));
        System.out.println(map.get("aa"));

Java 如何正确解码传递给 servlet 的 unicode 参数

提问by Grant Wagner

采纳答案by bobince

回答by jacobangel

回答by Michael Borgwardt

回答by Grant Wagner

回答by Modi

回答by Mr_and_Mrs_D

回答by Ben B

回答by Aung Aung

相关推荐

最近更新

标签

Java 如何正确解码传递给 servlet 的 unicode 参数

提问by Grant Wagner

采纳答案by bobince

回答by jacobangel

回答by Michael Borgwardt

回答by Grant Wagner

回答by Modi

回答by Mr_and_Mrs_D

回答by Ben B

回答by Aung Aung

相关推荐

Java 运行 rmi 服务器，classnotfound

如何在java中的字符串中打印\n

Java 打印前 N 个素数

Java ReentrantReadWriteLocks - 如何安全地获取写锁？

相关推荐

最近更新

标签