如何在 Java 中将 UTF-8 转换为 US-Ascii

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/285228/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 12:31:23  来源:igfitidea点击:

How to convert UTF-8 to US-Ascii in Java

javautf-8ascii

提问by Ulf Lindback

We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit

我们有一个系统,客户主要是欧洲人输入文本(以 UTF-8 格式),必须分发到不同的系统,其中大多数接受 UTF-8,但现在我们还必须将文本分发到只接受美国的美国系统-Ascii 7 位

So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?

所以现在我们需要将所有欧洲字符转换为最近的 US-Ascii。是否有任何 Java 库可以帮助完成此任务?

Right now we've just started adding to a translation table, where ? (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.

现在我们刚刚开始添加到翻译表,在哪里?(swedish AA)->A 等等,如果我们找不到与输入的字符匹配的任何字符,我们将记录它并替换为问号并尝试在下一个版本中修复它,但这似乎非常低效之前肯定有人做过类似的事情。

采纳答案by Jouni K. Sepp?nen

The uni2asciiprogram is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).

uni2ascii程序是用C写的,但你可以把它毫不费力可能转换成Java。它包含一个大的近似表(隐式地,在 switch-case 语句中)。

Be aware that there are no universally accepted approximations: Germans want you to replace ? by AE, Finns and Swedes prefer just A. Your example of ?isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.

请注意,没有普遍接受的近似值:德国人希望您替换 ? by AE,芬兰人和瑞典人只喜欢 A。你的例子也不明显:瑞典人可能会放弃戒指并使用 A,但丹麦人和挪威人可能更喜欢历史上更正确的 AA。

回答by sblundy

There are some built in functions to do this. The main class involved is CharsetEncoder, which is part of the niopackage. A simpler way is String.getBytes(Charset)that can be sent to a ByteArrayOutputStream.

有一些内置函数可以做到这一点。涉及的主要类是CharsetEncoder,它是nio包的一部分。一种更简单的方法是String.getBytes(Charset)可以将其发送到ByteArrayOutputStream.

回答by CesarB

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

您可以将文本转换为规范化形式 D,而不是创建自己的表格,其中字符表示为基本字符加上变音符号(例如,“á”将被替换为“a”,后跟组合重音符号)。然后,您可以删除所有不是 ASCII 字母的内容。

The tables still exist, but are now the ones from the Unicode standard.

这些表仍然存在,但现在是来自 Unicode 标准的表。

You could also try NFKD instead of NFD, to catch even more cases.

您也可以尝试使用 NFKD 而不是 NFD,以捕获更多案例。

References:

参考:

回答by Joe Liversedge

This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilterimplementation. This isn't really designed for plugging into a random local implementation, but does the trick.

这在搜索应用程序中通常很有用。请参阅相应的 Lucene ISOLatin1AccentFilter实现。这并不是真正为插入随机本地实现而设计的,但可以解决问题。

回答by Rob

This is what seems to work:

这似乎是有效的:

private synchronized static String utftoasci(String s){
  final StringBuffer sb = new StringBuffer( s.length() * 2 );

  final StringCharacterIterator iterator = new StringCharacterIterator( s );

  char ch = iterator.current();

  while( ch != StringCharacterIterator.DONE ){
   if(Character.getNumericValue(ch)>0){
    sb.append( ch );
   }else{
    boolean f=false;
    if(Character.toString(ch).equals("ê")){sb.append("E");f=true;}
    if(Character.toString(ch).equals("è")){sb.append("E");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("e");f=true;}
    if(Character.toString(ch).equals("é")){sb.append("e");f=true;}
    if(Character.toString(ch).equals("è")){sb.append("e");f=true;}
    if(Character.toString(ch).equals("è")){sb.append("e");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("A");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("a");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("ss");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("C");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("O");f=true;}
    if(Character.toString(ch).equals("o")){sb.append("");f=true;}
    if(Character.toString(ch).equals("ó")){sb.append("O");f=true;}
    if(Character.toString(ch).equals("a")){sb.append("");f=true;}
    if(Character.toString(ch).equals("o")){sb.append("");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("N");f=true;}
    if(Character.toString(ch).equals("é")){sb.append("E");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("A");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("A");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("a");f=true;}
    if(Character.toString(ch).equals("ü")){sb.append("U");f=true;}
    if(Character.toString(ch).equals("?")){sb.append("o");f=true;}
    if(Character.toString(ch).equals("ü")){sb.append("u");f=true;}
    if(Character.toString(ch).equals("á")){sb.append("a");f=true;}
    if(Character.toString(ch).equals("ó")){sb.append("O");f=true;}
    if(Character.toString(ch).equals("é")){sb.append("E");f=true;}
    if(!f){
     sb.append("?");
    }
   }
   ch = iterator.next();
  }
  return sb.toString();
 }

回答by Simon Lieschke

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):

您可以使用以下方法执行此操作(来自此 Core Java Technology Tech Tip 中的 NFD 示例):

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+","");
}

回答by code_monk

this is what i use:

这是我使用的:

<?php
function remove_accent($str)  {
#   http://www.php.net/manual/en/function.preg-replace.php#96586
$a = array('à', 'á', '?', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', 'D', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'Y', '?', 'à', 'á', 'a', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'y', '?', 'ā', 'ā', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ē', 'ē', '?', '?', '?', '?', '?', '?', 'ě', 'ě', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ī', 'ī', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ń', '?', '?', '?', 'ň', '?', 'ō', 'ō', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ū', 'ū', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ǎ', 'ǎ', 'ǐ', 'ǐ', 'ǒ', 'ǒ', 'ǔ', 'ǔ', 'ǖ', 'ǖ', 'ǘ', 'ǘ', 'ǚ', 'ǚ', 'ǜ', 'ǜ', '?', '?', '?', '?', '?', '?'); 
$b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o'); 
return str_replace($a, $b, $str); 
}

function SEOify($i){
#   http://php.ca/manual/en/function.preg-replace.php#90316
$o          = $i;
$o          = html_entity_decode($o,ENT_COMPAT,'UTF-8');
$o          = remove_accent(trim($o)); 
$patterns   = array( "([])" , "([^a-zA-Z0-9_-])", "(-{2,})" ); 
$replacers  = array("-", "", "-"); 
$o          = preg_replace($patterns, $replacers, $o);
return $o;
}
?>

回答by Terra Caines

new String("?".getBytes("US-ASCII"))

new String("?".getBytes("US-ASCII"))

回答by Matt Storer

In response to the answer given by Joe Liversedge, the referenced Lucene ISOLatin1AccentFilterno longer exists:

响应Joe Liversedge 给出答案,引用的 Lucene ISOLatin1AccentFilter不再存在

It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter:

它已被org.apache.lucene.analysis.ASCIIFoldingFilter取代:

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.

此类将不在前 127 个 ASCII 字符(“基本拉丁语”Unicode 块)中的字母、数字和符号 Unicode 字符转换为它们的 ASCII 等价物(如果存在)。转换来自以下 Unicode 块的字符;然而,只有那些具有合理 ASCII 替代的字符才会被转换。

FYI -

供参考 -