java 正则表达式检查代码是否包含非 UTF-8 字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13116685/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 11:33:34  来源:igfitidea点击:

Regexp to check if code contains non-UTF-8 characters?

javaregexutf-8sonarqube

提问by user1340582

I am using PMD, checkstyle, findbugs, etc. in Sonar. I would like to have a rule verifying that Java code contains no characters not part of UTF-8.

我在声纳中使用 PMD、checkstyle、findbugs 等。我想要一个规则来验证 Java 代码不包含不属于 UTF-8 的字符。

E.g. the character ? should not be allowed

例如字符?不应该被允许

I could not find a rule for this in the above plugins, but I guess a custom rule can be made in Sonar.

我在上述插件中找不到相关规则,但我想可以在 Sonar 中制定自定义规则。

回答by kshepherd

Here is the regular expression which will match only valid UTF-8 byte sequences:

这是仅匹配有效 UTF-8 字节序列的正则表达式:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/

I have derived it from RFC 3629 UTF-8, a transformation format of ISO 10646section 4 - Syntax of UTF-8 Byte Sequences.

我是从RFC 3629 UTF-8 中衍生出来的,它是 ISO 10646第 4 节的转换格式- UTF-8 字节序列的语法。

Factorizing the above gives the slightly shorter:

将上述因式分解得到略短的:

/^([\x00-\x7F]|([\xC2-\xDF]|\xE0[\xA0-\xBF]|\xED[\x80-\x9F]|(|[\xE1-\xEC]|[\xEE-\xEF]|\xF0[\x90-\xBF]|\xF4[\x80-\x8F]|[\xF1-\xF3][\x80-\xBF])[\x80-\xBF])[\x80-\xBF])*$/

This simple perl script demonstrates usage:

这个简单的 perl 脚本演示了用法:

#!/usr/bin/perl -w
my $passstring = "This string \xEF\xBF\xBD == ? is valid UTF-8";
my $failstring = "This string \x{FFFD} == ? is not valid UTF-8";
if ($passstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/)
    {
    print 'Passstring passed'."\n";
    }
else
    {
    print 'Passstring did not pass'."\n";
    }
if ($failstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/)
    {
    print 'Failstring passed'."\n";
    }
else
    {
    print 'Failstring did not pass'."\n";
    }
exit;

It produces the following output:

它产生以下输出:

Passstring passed
Failstring did not pass