如何检测 Java 字符串中的 unicode 字符？

Question

提问by Geo

Suppose I have a string that contains ü. How would I find all those unicode characters? Should I test for their code? How would I do that?

假设我有一个包含 ü 的字符串。我将如何找到所有这些 unicode 字符？我应该测试他们的代码吗？我该怎么做？

For example, given the string "AüXü", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.

例如，给定字符串“AüXü”，我想将其转换为“AYXY”。我想对其他 unicode 字符做同样的事情，我不想将它们存储在某种翻译映射中。

Answer 1

采纳答案by BalusC

The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset. If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.

“unicode 字符”的定义含糊不清，但将被视为标准ISO 8859 字符集未涵盖的 UTF-8字符。如果在您的情况下是这样，则循环遍历字符串中的所有字符并测试其代码点以确定它是否在给定的字符集中。

Alternatively, use a Map<Character, Character>and characters in the map that contain match the keys. For example:

或者，Map<Character, Character>在包含匹配键的映射中使用和字符。例如：

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
    put('ü', 'Y');
    // Put more here.
}};

String originalString = "AüAü";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();

Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizerto remove diacritical marks:

或者，您的意思是“所有带有变音符号的字符”？如果是这样，则用于java.text.Normalizer删除变音符号：

/**
 * Remove any diacritical marks (accents like ?, ?, é, etc) from
 * the given string (so that it returns plain c, n, e, etc).
 * @param string The string to remove diacritical marks from.
 * @return The string with removed diacritical marks, if any.
 */
public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\p{InCombiningDiacriticalMarks}+", "");
}

One pitfall, ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.

一个陷阱，ü 会变成 U，而不是 Y。不确定这是否是您想要的。如果你想用发音字符替换，你真的需要创建一个映射。当然，这是一项乏味的工作，但它的完成时间比您关注本主题所需的时间要短。

Answer 2

回答by msp

You could go the other way round and ask if the character is an ascii character.

您可以反过来询问该字符是否为 ascii 字符。

public static boolean isAscii(char ch) {
    return ch < 128;
}

You'd have to analyse the string char by char then of course.

当然，您必须逐个字符地分析字符串字符。

(the method is from commons-lang CharUtilswhich contains loads of useful Character methods)

（该方法来自commons-lang CharUtils，其中包含大量有用的 Character 方法）

Answer 3

回答by Dominic Rodger

I'm not sure from your example what you're trying to do - if you're just trying to replace all non-ASCII values with Y, then you could loop through the string looking for codepoints outside of the range 0 to 127, and replace them those code points with Y.

我不确定从你的例子中你想做什么 - 如果你只是想用 Y 替换所有非 ASCII 值，那么你可以遍历字符串查找 0 到 127 范围之外的代码点，并将这些代码点替换为 Y。

Answer 4

回答by jitter

You could loop through your string and for every character call

你可以遍历你的字符串和每个字符调用

If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
 // replace with Y
}

Answer 5

回答by McDowell

It isn't clear to me exactly what is gained by transforming "AüXü" to "AYXY". Is this because ü is pronounced like Y in a particular language? What language? And what other rules might apply?

我不清楚将“AüXü”转换为“AYXY”究竟能获得什么。这是因为 ü 在特定语言中发音像 Y 吗？什么语言？还有哪些其他规则可能适用？

In terms of terminology...

在术语方面...

"a"

The above is a Unicode string. It contains a single UTF-16 encoded character.

上面是一个Unicode字符串。它包含一个单独的 UTF-16 编码字符。

If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer.

如果您希望将字符范围限制为英文字母，请查看此答案中执行的规范化。

Answer 6

回答by Bhanu PS Kushwah

The class Characteralso offers some interesting methods. Take a look at it.

该类Character还提供了一些有趣的方法。看一看。

Character.UnicodeBlock.of('a') == Character.UnicodeBlock.BASIC_LATIN; //true

Character.UnicodeBlock.of('?') == Character.UnicodeBlock.BASIC_LATIN; //false

如何检测 Java 字符串中的 unicode 字符？

提问by Geo

采纳答案by BalusC

回答by msp

回答by Dominic Rodger

回答by jitter

回答by McDowell

回答by Bhanu PS Kushwah

相关推荐

最近更新

标签

如何检测 Java 字符串中的 unicode 字符？

提问by Geo

采纳答案by BalusC

回答by msp

回答by Dominic Rodger

回答by jitter

回答by McDowell

回答by Bhanu PS Kushwah

相关推荐

Java 在 ImageView 上绘制矩形

Java 我正在尝试验证用户名和密码

Java 通过反射将一个类中字段的所有值复制到另一个类中

Java 所需项目中存在 Eclipse 错误，但编辑器未显示错误

相关推荐

最近更新

标签