Java Unicode 替换为 ASCII

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24215063/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 10:51:24  来源:igfitidea点击:

Unicode Replacement with ASCII

javastringencodingcharacter-encoding

提问by saurav

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :

我在 Windows 系统上创建了一个文本文件,我认为默认编码样式是 ANSI,文件内容如下所示:

This is\u2019 a sample text file \u2014and it can ....

I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.

我使用 windows 的默认编码风格保存了这个文件,虽然也有编码风格,如 UTF-8、UTF-16 等。

Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.

现在我想编写一个简单的 java 函数,我将在其中传递一些输入字符串并用相应的 ascii 值替换所有 unicode。

e.g :- \u2019 should be replaced with "'" \u2014 should be replaced with "-" and so on.

例如:- \u2019 should be replaced with "'" \u2014 should be replaced with "-" and so on.

Observation :When i created a string literal like this

观察:当我创建这样的字符串文字时

  String s = "This is\u2019 a sample text file \u2014and it can ....";

My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .

我的代码工作正常,但是当我从文件中读取它时它不起作用。我知道在 Java String 中使用 UTF-16 encoding 。

Below is the code that I am using to read the input file.

下面是我用来读取输入文件的代码。

FileReader fileReader  = new FileReader(new File("C:\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();

I also tried using the InputStream and setting the Charset to UTF-8, but still the same result.

我也尝试使用InputStream and setting the Charset to UTF-8,但结果仍然相同。

Replacement code :

替换代码:

public static String removeUTFCharacters(String data){      
        for(Entry<String,String> entry : utfChars.entrySet()){
            data=data.replaceAll(entry.getKey(), entry.getValue());
        }
        return data;
    }

Map :

地图 :

    utfChars.put("\u2019","'");
    utfChars.put("\u2018","'");
    utfChars.put("\u201c","\"");
    utfChars.put("\u201d","\"");
    utfChars.put("\u2013","-");
    utfChars.put("\u2014","-");
    utfChars.put("\u2212","-");
    utfChars.put("\u2022","*");

Can anybody help me in understanding the concept and solution to this problem.

任何人都可以帮助我理解这个问题的概念和解决方案。

采纳答案by erickson

Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.

将转义序列 \uXXXX 与正则表达式匹配。然后使用替换循环将每个出现的转义序列替换为字符的解码值。

Because Java string literals use \to introduce escapes, the sequence \\is used to represent \. Also, the Java regex syntax treats the sequence \uspecially (to represent a Unicode escape). So the \has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u"really means, "match \uin the input."

由于 Java 字符串文字用于\引入转义,因此该序列\\用于表示\. 此外,Java 正则表达式语法对序列进行了\u特殊处理(以表示 Unicode 转义)。所以\必须再次转义,附加\\. 所以,在模式中,"\\\\u"真正的意思是“\u在输入中匹配”。

To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})"in the pattern means, "match 4 hexadecimal characters in the input, and capture them."

要匹配数字部分(四个十六进制字符),请使用模式\p{XDigit}\并使用额外的\. 我们想轻松地将十六进制数提取为一个组,因此将它括在括号中以创建一个捕获组。因此,"(\\p{XDigit}{4})"在模式中的意思是“匹配输入中的 4 个十六进制字符,并捕获它们”。

In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16)means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.

在循环中,我们搜索模式的出现,用解码后的字符值替换每个出现。字符值通过解析十六进制数来解码。Integer.parseInt(m.group(1), 16)意思是“将上一场比赛中捕获的组解析为 base-16 数字。” 然后使用该字符创建替换字符串。替换字符串必须被转义或引用,以防它$在替换文本中具有特殊含义。

String data = "This is\u2019 a sample text file \u2014and it can ...";
Pattern p = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
  String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
  m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);

回答by Stanislas Klukowski

If you can use another library, you can use apache commons https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

如果可以使用其他库,则可以使用 apache commons https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"