Java 如何确定字符串是否包含无效的编码字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/887148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 20:41:52  来源:igfitidea点击:

How to determine if a String contains invalid encoded characters

javastringunicodeencoding

提问by Daniel Hiller

Usage scenario

使用场景

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

我们已经实现了一个 Web 服务,我们的 Web 前端开发人员在内部使用它(通过 php api)来显示产品数据。在网站上,用户输入一些东西(即查询字符串)。在内部,网站通过 api 调用服务。

Note: We use restlet, not tomcat

注意:我们用的是restlet,不是tomcat

Original Problem

原问题

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Firefox 3.0.10 似乎尊重浏览器中选定的编码,并根据选定的编码对 url 进行编码。这确实会导致 ISO-8859-1 和 UTF-8 的查询字符串不同。

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

我们的网站转发来自用户的输入并且不转换它(它应该转换),因此它可以通过使用包含德语变音变音的查询字符串调用 web 服务的 api 调用服务。

I.e. for a query part looking like

即查询部分看起来像

    ...v=abc?def

if "ISO-8859-1" is selected, the sent query part looks like

如果选择“ISO-8859-1”,发送的查询部分看起来像

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

但是如果选择了“UTF-8”,发送的查询部分看起来像

...v=abc%C3%A4def

Desired Solution

所需的解决方案

As we control the service, because we've implemented it, we want to check on server sidewether the call contains non utf-8 characters, if so, respond with an 4xx http status

当我们控制服务时,因为我们已经实现了它,我们想在服务器端检查调用是否包含非 utf-8 字符,如果是,则以 4xx http 状态响应

Current Solution In Detail

当前解决方案的详细信息

Check for each character ( == string.substring(i,i+1) )

检查每个字符( == string.substring(i,i+1) )

  1. if character.getBytes()[0] equals 63 for '?'
  2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
  1. 如果 character.getBytes()[0] 对于 '?' 等于 63
  2. 如果 Character.getType(character.charAt(0)) 返回 OTHER_SYMBOL

Code

代码

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

这会捕获所有无效(非 utf 编码)字符吗?你们中有人有更好(更简单)的解决方案吗?

Note:I checked URLDecoder with the following code

注意:我使用以下代码检查了 URLDecoder

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

这打印:

v=abc?def
v=abc?def
v=abc?def
v=abc?¤def

and it does notthrow an IllegalArgumentException sigh

它并没有抛出IllegalArgumentException叹息

采纳答案by ZZ Coder

I asked the same question,

我问了同样的问题,

Handling Character Encoding in URI on Tomcat

在 Tomcat 上处理 URI 中的字符编码

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

我最近找到了一个解决方案,它对我来说效果很好。你可能想试一试。这是你需要做的,

  1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
  2. If you have to manually URL decode, use Latin1 as charset also.
  3. Use the fixEncoding() function to fix up encodings.
  1. 将您的 URI 编码保留为 Latin-1。在 Tomcat 上,将 URIEncoding="ISO-8859-1" 添加到 server.xml 中的连接器。
  2. 如果您必须手动进行 URL 解码,也可以使用 Latin1 作为字符集。
  3. 使用 fixEncoding() 函数修复编码。

For example, to get a parameter from query string,

例如,要从查询字符串中获取参数,

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

你总是可以这样做。编码正确的字符串不会改变。

The code is attached. Good luck!

附上代码。祝你好运!

 public static String fixEncoding(String latin1) {
  try {
   byte[] bytes = latin1.getBytes("ISO-8859-1");
   if (!validUTF8(bytes))
    return latin1;   
   return new String(bytes, "UTF-8");  
  } catch (UnsupportedEncodingException e) {
   // Impossible, throw unchecked
   throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
  }

 }

 public static boolean validUTF8(byte[] input) {
  int i = 0;
  // Check for BOM
  if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
    && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
   i = 3;
  }

  int end;
  for (int j = input.length; i < j; ++i) {
   int octet = input[i];
   if ((octet & 0x80) == 0) {
    continue; // ASCII
   }

   // Check for UTF-8 leading byte
   if ((octet & 0xE0) == 0xC0) {
    end = i + 1;
   } else if ((octet & 0xF0) == 0xE0) {
    end = i + 2;
   } else if ((octet & 0xF8) == 0xF0) {
    end = i + 3;
   } else {
    // Java only supports BMP so 3 is max
    return false;
   }

   while (i < end) {
    i++;
    octet = input[i];
    if ((octet & 0xC0) != 0x80) {
     // Not a valid trailing byte
     return false;
    }
   }
  }
  return true;
 }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get ? or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

编辑:由于各种原因,您的方法不起作用。当出现编码错误时,您不能指望从 Tomcat 得到什么。有时你得到?或者 ?。其他时候,你什么也得不到,getParameter() 返回 null。假设您可以检查“?”,您的查询字符串包含有效的“?”会发生什么??

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

此外,您不应拒绝任何请求。这不是您用户的错。正如我在最初的问题中提到的,浏览器可能会以 UTF-8 或 Latin-1 对 URL 进行编码。用户没有控制权。你需要接受两者。将您的 servlet 更改为 Latin-1 将保留所有字符,即使它们是错误的,也让我们有机会修复它或将其丢弃。

The solution I posted here is not perfect but it's the best one we found so far.

我在这里发布的解决方案并不完美,但它是我们迄今为止找到的最好的解决方案。

回答by Brian Agnew

URLDecoderwill decode to a given encoding. This should flag errors appropriately. However the documentation states:

URLDecoder将解码为给定的编码。这应该适当地标记错误。但是文档指出:

There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.

这个解码器有两种可能的方式来处理非法字符串。它可以不理会非法字符,也可以抛出 IllegalArgumentException。解码器采用哪种方法取决于实现。

So you should probably try it. Note also (from the decode() method documentation):

所以你应该尝试一下。另请注意(来自 decode() 方法文档):

The World Wide Web Consortium Recommendationstates that UTF-8 should be used. Not doing so may introduce incompatibilites

万维网联盟的建议指出,UTF-8应该被使用。不这样做可能会导致不兼容

so there's something else to think about!

所以还有其他事情要考虑!

EDIT: Apache Commons URLDecodeclaims to throw appropriate exceptions for bad encodings.

编辑:Apache Commons URLDecode声称会为错误的编码抛出适当的异常。

回答by daniel

You need to setup the character encoding from the start. Try sending the proper Content-Typeheader, for example Content-Type: text/html; charset=utf-8to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encodingfor Web Services. Examine your response headers.

您需要从一开始就设置字符编码。尝试发送正确的Content-Type标头,例如Content-Type: text/html; charset=utf-8修复正确的编码。标准一致性将 utf-8 和 utf-16 称为Web 服务的正确编码。检查您的响应标头。

Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.

此外,在服务器端——在浏览器没有正确处理服务器发送的编码的情况下——通过分配一个新的字符串来强制编码。您还可以通过执行单个each_byte & 0x80来检查编码的 utf-8 字符串中的每个字节,验证结果为非零。


boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
    if ((strBytes[i] & 0x80) != 0) {
        continue;
    } else {
        /* treat the string as non utf encoded */
        utfEncoded = false;
        break;
    }
}

String realQueryString = utfEncoded ?
    queryString : new String(queryString.getBytes(), "iso-8859-1");

Also, take a look on this article, I hope it would help you.

另外,看看这篇文章,我希望它会帮助你。

回答by Adrian McCarthy

I've been working on a similar "guess the encoding" problem. The best solution involves knowingthe encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

我一直在研究类似的“猜测编码”问题。最好的解决方案是了解编码。除此之外,您可以做出有根据的猜测来区分 UTF-8 和 ISO-8859-1。

To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:

要回答有关如何检测字符串是否正确编码为 UTF-8 的一般问题,您可以验证以下事项:

  1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
  2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
  3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
  1. 没有字节是 0x00、0xC0、0xC1 或在 0xF5-0xFF 范围内。
  2. 尾字节 (0x80-0xBF) 总是在头字节 0xC2-0xF4 或另一个尾字节之前。
  3. 头字节应该正确预测尾字节的数量(例如,0xC2-0xDF 中的任何字节都应该紧跟在 0x80-0xBF 范围内的一个字节)。

If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it isUTF-8, but it's a good predictor.

如果一个字符串通过了所有这些测试,那么它就可以解释为有效的 UTF-8。这并不能保证它UTF-8,但它是一个很好的预测器。

Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

ISO-8859-1 中的合法输入可能没有除行分隔符之外的控制字符(0x00-0x1F 和 0x80-0x9F)。看起来 0x7F 也没有在 ISO-8859-1 中定义。

(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

(我基于 UTF-8 和 ISO-8859-1 的维基百科页面。)

回答by dimus

the following regular expression might be of interest for you:

您可能对以下正则表达式感兴趣:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624

I use it in ruby as following:

我在 ruby​​ 中使用它如下:

module Encoding
    UTF8RGX = /\A(
        [\x09\x0A\x0D\x20-\x7E]            # ASCII
      | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
      |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
      | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
      |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
      |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
      | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
      |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*\z/x unless defined? UTF8RGX

    def self.utf8_file?(fileName)
      count = 0
      File.open("#{fileName}").each do |l|
        count += 1
        unless utf8_string?(l)
          puts count.to_s + ": " + l
        end
      end
      return true
    end

    def self.utf8_string?(a_string)
      UTF8RGX === a_string
    end

end

回答by ante

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

如果发现无效字符,您可以使用配置为抛出异常的 CharsetDecoder:

 CharsetDecoder UTF8Decoder =
      Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

参见CodingErrorAction.REPORT

回答by Dennis C

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)

尝试在您可以触摸的任何地方一如既往地使用 UTF-8 作为默认值。(数据库、内存和用户界面)

One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

单一字符集编码可以减少很多问题,实际上它可以加快您的 Web 服务器性能。编码/解码浪费了太多的处理能力和内存。

回答by mfx

You might want to include a known parameter in your requests, e.g. "...&encTest=?", to safely differentiate between the different encodings.

您可能希望在请求中包含一个已知参数,例如“...&encTest=?”,以安全地区分不同的编码。

回答by Zhile Zou

Replace all control chars into empty string

将所有控制字符替换为空字符串

value = value.replaceAll("\p{Cntrl}", "");

回答by luca

This is what I used to check the encoding:

这是我用来检查编码的内容:

CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);

CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
    result.isUnderflow() || result.isMalformed() ||
    result.isUnmappable())
{
    System.out.println("Cannot decode EBCDIC");
}
else
{
    CoderResult result = ebcdicDecoder.flush(out);
    if (result.isOverflow())
       System.out.println("Cannot decode EBCDIC");
    if (result.isUnderflow())
        System.out.println("Ebcdic decoded succefully ");
}

Edit: updated with Vouze suggestion

编辑:更新了 Vouze 建议