在不改变字符串长度的情况下将 Unicode 转换为 ASCII(在 Java 中)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2096667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 03:30:40  来源:igfitidea点击:

Convert Unicode to ASCII without changing the string length (in Java)

javastringunicodeascii

提问by Zardoz

What is the best way to convert a string from Unicode to ASCII without changing it's length (that is very important in my case)? Also the characters without any conversion problems must be at the same positions as in the original string. So an "?" must be converted to "A" and not something cryptic that has more characters.

在不改变长度的情况下将字符串从 Unicode 转换为 ASCII 的最佳方法是什么(这对我来说非常重要)?此外,没有任何转换问题的字符必须与原始字符串位于相同的位置。所以一个“?” 必须转换为“A”,而不是具有更多字符的神秘东西。

Edit:
@novalis - Such symbols (for example of asian languages) should just be converted to some placeholders. I am not too interested in those words or what they mean.

编辑:
@novalis - 此类符号(例如亚洲语言)应仅转换为某些占位符。我对这些词或它们的意思不太感兴趣。

@MtnViewMark - I must preserve the number of all characters and the position of ASCII available characters under any circumstance.

@MtnViewMark - 在任何情况下,我都必须保留所有字符的数量和 ASCII 可用字符的位置。

Here some more info: I have some text mining tools that can only process ASCII strings. Most of the text that should be processed is in English, but some do contain non ASCII characters. I am not interested in those words, but I must be sure that the words I am interested in (those that only contain ASCII characters) are at the same positions after the string conversion.

这里有更多信息:我有一些只能处理 ASCII 字符串的文本挖掘工具。大多数应该处理的文本是英文的,但有些确实包含非 ASCII 字符。我对那些词不感兴趣,但我必须确保我感兴趣的词(那些只包含 ASCII 字符的词)在字符串转换后位于相同的位置。

采纳答案by Denis Tulskiy

As stated in thisanswer, the following code should work:

thisanswer中所述,以下代码应该有效:

    String s = "口水雞 hello ?";

    String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
    String regex = "[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}]+";

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(s.length() == s2.length());

Output is

输出是

??? hello A
true

So you first remove diactrical marks, the convert to ascii. Non-ascii characters will become question marks.

所以你首先删除符号,转换为ascii。非 ASCII 字符将成为问号。

回答by Ignacio Vazquez-Abrams

Use java.text.Normalizer.normalize()with Normalizer.Form.NFD, then filter out the non-ASCII characters.

使用java.text.Normalizer.normalize()with Normalizer.Form.NFD,然后过滤掉非 ASCII 字符。

回答by Pekka

Caveat: I don't know Java. Just a bit about character sets.

警告:我不懂 Java。只是一点关于字符集。

You are not stating which character set you are using exactly.

您没有说明您正在使用哪个字符集。

But no matter which you use, it's impossible to convert a Unicode string to ASCII andretain the original length and character positions, simply because a Unicode character set will use multiple bytes for some characters (obviously).

但是无论您使用哪种,都不可能将 Unicode 字符串转换为 ASCII保留原始长度和字符位置,这仅仅是因为 Unicode 字符集会为某些字符使用多个字节(显然)。

The only exception I know of would be a UTF-8 string that contains only ASCII characters: This string will already be identical in both UTF-8 and ASCII, because UTF-8 uses multibyte characters only when necessary. (I don't know about the other Unicode flavours, there may be other dynamic ones).

我所知道的唯一例外是仅包含 ASCII 字符的 UTF-8 字符串:该字符串在 UTF-8 和 ASCII 中已经相同,因为 UTF-8 仅在必要时使用多字节字符。(我不知道其他 Unicode 风格,可能还有其他动态的)。

The only workaround I can see is adding a space to any special character that was replaced by an ASCII one, but that will screw up the string (G?teborgin UTF8 would have to become Go teborgto keep the length).

我能看到的唯一解决方法是为任何被 ASCII 替换的特殊字符添加一个空格,但这会弄乱字符串(G?teborg在 UTF8 中必须Go teborg保持长度)。

Maybe you want to elaborate on what you want to / need to achieve, so people here can suggest workarounds.

也许您想详细说明您想要/需要实现的目标,因此这里的人们可以提出解决方法。

回答by Paul Taylor

One isssue with Normalizer is that pre Java 1.6 its in sun.text package whereas in 1.6 its in java.text package and it method signature has changed. So if your application neeeds to run on both platforms you'll have to use reflection.

Normalizer 的一个问题是,在 Java 1.6 之前,它在 sun.text 包中,而在 1.6 中,它在 java.text 包中,并且它的方法签名已更改。因此,如果您的应用程序需要在两个平台上运行,您就必须使用反射。

An alternative custom solution is described as techniwue 3 here

此处将另一种自定义解决方案描述为 techniwue 3

回答by sporak

As Paul Taylor mentioned: there is issue with using Normalizer if you need the project to be compilable/runnable in pre-1.6 and also in 1.6 and higher java. You will get into troubles since Normalizer is in different packages (java.text.Normalizer(for 1.6) instead of sun.text.Normalizer(for pre-1.6)) and has different method-signature.

正如 Paul Taylor 所提到的:如果您需要项目在 1.6 之前以及 1.6 和更高版本的 Java 中可编译/可运行,则使用 Normalizer 存在问题。您会遇到麻烦,因为 Normalizer 位于不同的包中(java.text.Normalizer(对于 1.6)而不是sun.text.Normalizer(对于 pre-1.6))并且具有不同的方法签名。

Usually it is recommended to use reflection to invoke appropriate Normalizer.normalize() method. (Example could be found here).
But if you don't want to put reflection-mess in your code, you can use icu4j library. It contains com.ibm.icu.text.Normalizerclass with normalize()method that perform the same job as java.text.Normalizer/sun.text.Normalizer. Icu library has (should have) own implementation of Normalizer so you can share your project with library and that should be java-independent.
Disadvantage is that the icu library is quite big.

通常建议使用反射来调用合适的 Normalizer.normalize() 方法。(示例可以在这里找到)。
但是,如果您不想在代码中放置反射混乱,则可以使用icu4j 库。它包含com.ibm.icu.text.Normalizer具有与normalize()java.text.Normalizer/sun.text.Normalizer 执行相同工作的方法的类。Icu 库具有(应该)自己的 Normalizer 实现,因此您可以与库共享您的项目,并且应该与 java 无关。
缺点是icu库比较大。

If you using Normalizer class just for removing accents/diacritics from Strings, there's also another way. You can use Apache commons lang library (ver. 3)that contains StringUtilswith method stripAccents():

如果您使用 Normalizer 类只是为了从字符串中删除重音/变音符号,还有另一种方法。您可以使用包含方法的Apache commons lang 库(版本 3)StringUtilsstripAccents()

String noAccentsString = org.apache.commons.lang3.StringUtils.stripAccents(s);

Lang3 library probably use reflection to invoke appropriate Normalizer according to java version. So advantage is that you don't have reflection mess in your code.

Lang3 库可能会根据 java 版本使用反射来调用适当的 Normalizer。所以优点是你的代码中没有反射混乱。