Java 如何确定字符串是否包含无效的编码字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/887148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to determine if a String contains invalid encoded characters
提问by Daniel Hiller
Usage scenario
使用场景
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
我们已经实现了一个 Web 服务,我们的 Web 前端开发人员在内部使用它(通过 php api)来显示产品数据。在网站上,用户输入一些东西(即查询字符串)。在内部,网站通过 api 调用服务。
Note: We use restlet, not tomcat
注意:我们用的是restlet,不是tomcat
Original Problem
原问题
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Firefox 3.0.10 似乎尊重浏览器中选定的编码,并根据选定的编码对 url 进行编码。这确实会导致 ISO-8859-1 和 UTF-8 的查询字符串不同。
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
我们的网站转发来自用户的输入并且不转换它(它应该转换),因此它可以通过使用包含德语变音变音的查询字符串调用 web 服务的 api 调用服务。
I.e. for a query part looking like
即查询部分看起来像
...v=abc?def
if "ISO-8859-1" is selected, the sent query part looks like
如果选择“ISO-8859-1”,发送的查询部分看起来像
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
但是如果选择了“UTF-8”,发送的查询部分看起来像
...v=abc%C3%A4def
Desired Solution
所需的解决方案
As we control the service, because we've implemented it, we want to check on server sidewether the call contains non utf-8 characters, if so, respond with an 4xx http status
当我们控制服务时,因为我们已经实现了它,我们想在服务器端检查调用是否包含非 utf-8 字符,如果是,则以 4xx http 状态响应
Current Solution In Detail
当前解决方案的详细信息
Check for each character ( == string.substring(i,i+1) )
检查每个字符( == string.substring(i,i+1) )
- if character.getBytes()[0] equals 63 for '?'
- if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
- 如果 character.getBytes()[0] 对于 '?' 等于 63
- 如果 Character.getType(character.charAt(0)) 返回 OTHER_SYMBOL
Code
代码
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
Question
题
Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?
这会捕获所有无效(非 utf 编码)字符吗?你们中有人有更好(更简单)的解决方案吗?
Note:I checked URLDecoder with the following code
注意:我使用以下代码检查了 URLDecoder
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
This prints:
这打印:
v=abc?def
v=abc?def
v=abc?def
v=abc?¤def
and it does notthrow an IllegalArgumentException sigh
它并没有抛出IllegalArgumentException叹息
采纳答案by ZZ Coder
I asked the same question,
我问了同样的问题,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
我最近找到了一个解决方案,它对我来说效果很好。你可能想试一试。这是你需要做的,
- Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
- If you have to manually URL decode, use Latin1 as charset also.
- Use the fixEncoding() function to fix up encodings.
- 将您的 URI 编码保留为 Latin-1。在 Tomcat 上,将 URIEncoding="ISO-8859-1" 添加到 server.xml 中的连接器。
- 如果您必须手动进行 URL 解码,也可以使用 Latin1 作为字符集。
- 使用 fixEncoding() 函数修复编码。
For example, to get a parameter from query string,
例如,要从查询字符串中获取参数,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
你总是可以这样做。编码正确的字符串不会改变。
The code is attached. Good luck!
附上代码。祝你好运!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get ? or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
编辑:由于各种原因,您的方法不起作用。当出现编码错误时,您不能指望从 Tomcat 得到什么。有时你得到?或者 ?。其他时候,你什么也得不到,getParameter() 返回 null。假设您可以检查“?”,您的查询字符串包含有效的“?”会发生什么??
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
此外,您不应拒绝任何请求。这不是您用户的错。正如我在最初的问题中提到的,浏览器可能会以 UTF-8 或 Latin-1 对 URL 进行编码。用户没有控制权。你需要接受两者。将您的 servlet 更改为 Latin-1 将保留所有字符,即使它们是错误的,也让我们有机会修复它或将其丢弃。
The solution I posted here is not perfect but it's the best one we found so far.
我在这里发布的解决方案并不完美,但它是我们迄今为止找到的最好的解决方案。
回答by Brian Agnew
URLDecoderwill decode to a given encoding. This should flag errors appropriately. However the documentation states:
URLDecoder将解码为给定的编码。这应该适当地标记错误。但是文档指出:
There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.
这个解码器有两种可能的方式来处理非法字符串。它可以不理会非法字符,也可以抛出 IllegalArgumentException。解码器采用哪种方法取决于实现。
So you should probably try it. Note also (from the decode() method documentation):
所以你应该尝试一下。另请注意(来自 decode() 方法文档):
The World Wide Web Consortium Recommendationstates that UTF-8 should be used. Not doing so may introduce incompatibilites
在万维网联盟的建议指出,UTF-8应该被使用。不这样做可能会导致不兼容
so there's something else to think about!
所以还有其他事情要考虑!
EDIT: Apache Commons URLDecodeclaims to throw appropriate exceptions for bad encodings.
编辑:Apache Commons URLDecode声称会为错误的编码抛出适当的异常。
回答by daniel
You need to setup the character encoding from the start. Try sending the proper Content-Typeheader, for example Content-Type: text/html; charset=utf-8to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encodingfor Web Services. Examine your response headers.
您需要从一开始就设置字符编码。尝试发送正确的Content-Type标头,例如Content-Type: text/html; charset=utf-8修复正确的编码。标准一致性将 utf-8 和 utf-16 称为Web 服务的正确编码。检查您的响应标头。
Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.
此外,在服务器端——在浏览器没有正确处理服务器发送的编码的情况下——通过分配一个新的字符串来强制编码。您还可以通过执行单个each_byte & 0x80来检查编码的 utf-8 字符串中的每个字节,验证结果为非零。
boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
if ((strBytes[i] & 0x80) != 0) {
continue;
} else {
/* treat the string as non utf encoded */
utfEncoded = false;
break;
}
}
String realQueryString = utfEncoded ?
queryString : new String(queryString.getBytes(), "iso-8859-1");
Also, take a look on this article, I hope it would help you.
另外,看看这篇文章,我希望它会帮助你。
回答by Adrian McCarthy
I've been working on a similar "guess the encoding" problem. The best solution involves knowingthe encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.
我一直在研究类似的“猜测编码”问题。最好的解决方案是了解编码。除此之外,您可以做出有根据的猜测来区分 UTF-8 和 ISO-8859-1。
To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
要回答有关如何检测字符串是否正确编码为 UTF-8 的一般问题,您可以验证以下事项:
- No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
- Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
- Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
- 没有字节是 0x00、0xC0、0xC1 或在 0xF5-0xFF 范围内。
- 尾字节 (0x80-0xBF) 总是在头字节 0xC2-0xF4 或另一个尾字节之前。
- 头字节应该正确预测尾字节的数量(例如,0xC2-0xDF 中的任何字节都应该紧跟在 0x80-0xBF 范围内的一个字节)。
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it isUTF-8, but it's a good predictor.
如果一个字符串通过了所有这些测试,那么它就可以解释为有效的 UTF-8。这并不能保证它是UTF-8,但它是一个很好的预测器。
Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.
ISO-8859-1 中的合法输入可能没有除行分隔符之外的控制字符(0x00-0x1F 和 0x80-0x9F)。看起来 0x7F 也没有在 ISO-8859-1 中定义。
(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)
(我基于 UTF-8 和 ISO-8859-1 的维基百科页面。)
回答by dimus
the following regular expression might be of interest for you:
您可能对以下正则表达式感兴趣:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624
I use it in ruby as following:
我在 ruby 中使用它如下:
module Encoding
UTF8RGX = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x unless defined? UTF8RGX
def self.utf8_file?(fileName)
count = 0
File.open("#{fileName}").each do |l|
count += 1
unless utf8_string?(l)
puts count.to_s + ": " + l
end
end
return true
end
def self.utf8_string?(a_string)
UTF8RGX === a_string
end
end
回答by ante
You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
如果发现无效字符,您可以使用配置为抛出异常的 CharsetDecoder:
CharsetDecoder UTF8Decoder =
Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
回答by Dennis C
Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)
尝试在您可以触摸的任何地方一如既往地使用 UTF-8 作为默认值。(数据库、内存和用户界面)
One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.
单一字符集编码可以减少很多问题,实际上它可以加快您的 Web 服务器性能。编码/解码浪费了太多的处理能力和内存。
回答by mfx
You might want to include a known parameter in your requests, e.g. "...&encTest=?", to safely differentiate between the different encodings.
您可能希望在请求中包含一个已知参数,例如“...&encTest=?”,以安全地区分不同的编码。
回答by Zhile Zou
Replace all control chars into empty string
将所有控制字符替换为空字符串
value = value.replaceAll("\p{Cntrl}", "");
回答by luca
This is what I used to check the encoding:
这是我用来检查编码的内容:
CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
result.isUnderflow() || result.isMalformed() ||
result.isUnmappable())
{
System.out.println("Cannot decode EBCDIC");
}
else
{
CoderResult result = ebcdicDecoder.flush(out);
if (result.isOverflow())
System.out.println("Cannot decode EBCDIC");
if (result.isUnderflow())
System.out.println("Ebcdic decoded succefully ");
}
Edit: updated with Vouze suggestion
编辑:更新了 Vouze 建议