java 用 ASCII 近似值替换 unicode 标点符号

Question

提问by schmmd

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

我正在读取 Java 程序中的一些文本文件，并想用 ASCII 近似值替换一些 Unicode 字符。这些文件最终会被分解成句子，然后输入 OpenNLP。OpenNLP 无法识别 Unicode 字符，并在许多符号上给出不正确的结果（它将“girl's”标记为“girl”和“'s”，但如果它是 Unicode 引号，则将其视为单个标记）。

For example, the source sentence may contain the Unicode directional quotation U2018(‘) and I would like to convert that to U0027('). Eventually I will be stripping the remaining Unicode.

例如，源语句可能包含 Unicode 定向引用U2018(')，我想将其转换为U0027(')。最终我将剥离剩余的 Unicode。

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

我知道我正在丢失信息，并且我知道我可以编写正则表达式来转换这些符号中的每一个，但我想知道是否有可以重用的代码来转换其中一些符号。

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

这是我能做到的，但我确信我会犯错误/遗漏事情/等等：

    // double quotation (")
    replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));

    // single quotation (')
    replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));

replacements is a custom class that I later run over and apply the replacements.

替换是一个自定义类，我稍后会运行并应用替换。

    for (Replacement replacement : replacements) {
         text = replacement.pattern.matcher(text).replaceAll(r.replacement);
    }

As you can see, I had to find:

如您所见，我必须找到：

LEFT SINGLE QUOTATION MARK
RIGHT SINGLE QUOTATION MARK
SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)

左单引号
右单引号
单个低 9 引号（这是什么/我应该替换它吗？）
单个 HIGH-REVERSED-9 引号（这是什么/我应该替换它吗？）

Answer 1

采纳答案by Michael Konietzka

Each unicode character is assigned a category. There exists two separate categories for quotes:

每个 unicode 字符都分配了一个类别。有两个单独的报价类别：

With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.

使用这些列表，如果您想手动编码正则表达式，您应该能够适当地处理所有引号。

Java Character.getTypegives you the category of character, for example FINAL_QUOTE_PUNCTUATION.

Java Character.getType为您提供字符的类别，例如FINAL_QUOTE_PUNCTUATION。

Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.

现在您可以获取每个（标点符号）字符的类别并将其替换为 ASCII 中的适当补充。

You can use the other punctuation categories accordingly. In 'Punctuation, Other'there are some characters, for example PRIME ′, which you may also want to substitute with an apostrophe.

您可以相应地使用其他标点符号类别。在“标点符号，其他”中有一些字符，例如 PRIME ′，您可能还想用撇号代替它们。

Answer 2

回答by Marek Stój

I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.

我找到了一个相当广泛的表格，将 Unicode 标点符号映射到它们最接近的 ASCII 等价物。

Here's more info: Map Symbols & Punctuation to ASCII.

这是更多信息：将符号和标点符号映射到 ASCII。

Answer 3

回答by schmmd

I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.

我按照@marek-stoj 的链接创建了一个 Scala 应用程序，该应用程序可以在保持字符串长度的同时清除字符串中的 unicode。它删除了变音符号（重音符号）并使用@marek-stoj 建议的映射将非 Ascii unicode 字符转换为它们的 ascii 近似值。

import java.text.Normalizer

object Asciifier {
  def apply(string: String) = {
    var cleaned = string
      for ((unicode, ascii) <- substitutions) {
        cleaned = cleaned.replaceAll(unicode, ascii)
      }

    // convert diacritics to a two-character form (NFD)
    // http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
    cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD)

    // remove all characters that combine with the previous character
    // to form a diacritic.  Also remove control characters.
    // http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
    cleaned.replaceAll("[\p{InCombiningDiacriticalMarks}\p{Cntrl}]", "")

    // size must not change
    require(cleaned.size == string.size)

    cleaned
  }

  val substitutions = Set(
      (0x00AB, '"'),
      (0x00AD, '-'),
      (0x00B4, '\''),
      (0x00BB, '"'),
      (0x00F7, '/'),
      (0x01C0, '|'),
      (0x01C3, '!'),
      (0x02B9, '\''),
      (0x02BA, '"'),
      (0x02BC, '\''),
      (0x02C4, '^'),
      (0x02C6, '^'),
      (0x02C8, '\''),
      (0x02CB, '`'),
      (0x02CD, '_'),
      (0x02DC, '~'),
      (0x0300, '`'),
      (0x0301, '\''),
      (0x0302, '^'),
      (0x0303, '~'),
      (0x030B, '"'),
      (0x030E, '"'),
      (0x0331, '_'),
      (0x0332, '_'),
      (0x0338, '/'),
      (0x0589, ':'),
      (0x05C0, '|'),
      (0x05C3, ':'),
      (0x066A, '%'),
      (0x066D, '*'),
      (0x200B, ' '),
      (0x2010, '-'),
      (0x2011, '-'),
      (0x2012, '-'),
      (0x2013, '-'),
      (0x2014, '-'),
      (0x2015, '-'),
      (0x2016, '|'),
      (0x2017, '_'),
      (0x2018, '\''),
      (0x2019, '\''),
      (0x201A, ','),
      (0x201B, '\''),
      (0x201C, '"'),
      (0x201D, '"'),
      (0x201E, '"'),
      (0x201F, '"'),
      (0x2032, '\''),
      (0x2033, '"'),
      (0x2034, '\''),
      (0x2035, '`'),
      (0x2036, '"'),
      (0x2037, '\''),
      (0x2038, '^'),
      (0x2039, '<'),
      (0x203A, '>'),
      (0x203D, '?'),
      (0x2044, '/'),
      (0x204E, '*'),
      (0x2052, '%'),
      (0x2053, '~'),
      (0x2060, ' '),
      (0x20E5, '\'),
      (0x2212, '-'),
      (0x2215, '/'),
      (0x2216, '\'),
      (0x2217, '*'),
      (0x2223, '|'),
      (0x2236, ':'),
      (0x223C, '~'),
      (0x2264, '<'),
      (0x2265, '>'),
      (0x2266, '<'),
      (0x2267, '>'),
      (0x2303, '^'),
      (0x2329, '<'),
      (0x232A, '>'),
      (0x266F, '#'),
      (0x2731, '*'),
      (0x2758, '|'),
      (0x2762, '!'),
      (0x27E6, '['),
      (0x27E8, '<'),
      (0x27E9, '>'),
      (0x2983, '{'),
      (0x2984, '}'),
      (0x3003, '"'),
      (0x3008, '<'),
      (0x3009, '>'),
      (0x301B, ']'),
      (0x301C, '~'),
      (0x301D, '"'),
      (0x301E, '"'),
      (0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) }
}

Answer 4

回答by vz0

While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.

虽然这并不能完全回答您的问题，但您可以将您的 Unicode 文本转换为 US-ASCII，用“?”替换非 ASCII 字符。符号。

String input = "aáeéiíoóuú"; // 10 chars.

Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});

ByteBuffer out = null;

try {
    out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) { 
    /* ignored, shouldn't happen */ 
}

String outStr = ch.decode(out).toString();

// Prints "a?e?i?o?u?"
System.out.println(outStr);

Answer 5

回答by Stephen P

What I've done for similar substitutions is create a Map(usually HashMap) with the Unicode characters as the keys and their substitute as the values.

我为类似的替换所做的是创建一个Map（通常HashMap）以 Unicode 字符作为键，并将它们的替换作为值。

Pseudo-Java; the fordepends on what sort of character container you're using as a parameter to the method that does this, e.g. String, CharSequence, etc.

伪Java；这for取决于您使用哪种字符容器作为执行此操作的方法的参数，例如 String、CharSequence 等。

StringBuilder output = new StringBuilder();
for (each Character 'c' in inputString)
{
    Character replacement = xlateMap.get( c );
    output.append( replacement != null ? replacement : c );
}
return output.toString();

Anything in the Map is replaced, anything not in the Map is unchanged and copied to output.

Map 中的任何内容都被替换，Map 中没有的任何内容都保持不变并复制到输出。

Answer 6

回答by JohnMudd

Here's a Python package that does a good job. It's based on a Perl module Text::Unidecode. I assume this could be ported to Java.

这是一个做得很好的 Python 包。它基于 Perl 模块 Text::Unidecode。我认为这可以移植到 Java。

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

http://pypi.python.org/pypi/Unidecode

java 用 ASCII 近似值替换 unicode 标点符号

提问by schmmd

采纳答案by Michael Konietzka

回答by Marek Stój

回答by schmmd

回答by vz0

回答by Stephen P

回答by JohnMudd

相关推荐

最近更新

标签

java 用 ASCII 近似值替换 unicode 标点符号

提问by schmmd

采纳答案by Michael Konietzka

回答by Marek Stój

回答by schmmd

回答by vz0

回答by Stephen P

回答by JohnMudd

相关推荐

java 为什么“mvn compile”需要“test-jar”依赖

java 将字符串拆分为几个两个字符串

java 是否可以将 JFrame 放在前面而不是焦点？

C++,java 中空类的大小是多少？

相关推荐

最近更新

标签