在 Java 中将 UTF-8 转换为 ISO-8859-1

Question

提问by Chocula

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, –and '(they display as ?).

我正在阅读 XML 文档 (UTF-8) 并最终使用 ISO-8859-1 在网页上显示内容。正如预期的那样，有几个字符没有正确显示，例如“，–和'（它们显示为？）。

Is it possible to convert these characters from UTF-8 to ISO-8859-1?

是否可以将这些字符从 UTF-8 转换为 ISO-8859-1？

Here is a snippet of code I have written to attempt this:

这是我为尝试这样做而编写的一段代码：

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {
  sb.append(line);
}
br.close();

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with

我不太确定出了什么问题，但我相信是 readLine() 引起了悲伤（因为字符串是 Java/UTF-16 编码的？）。我尝试的另一个变体是用 latin1 替换

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

我已经阅读了有关该主题的以前的帖子，并且正在学习。在此先感谢您的帮助。

Answer 1

采纳答案by McDowell

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizerroutines - but don't quote me.

我不确定标准库中是否有一个规范化例程可以做到这一点。我不认为“智能”引号的转换是由标准Unicode 规范化程序处理的- 但不要引用我的话。

The smart thing to do is to dump ISO-8859-1and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequencesas shown here:

明智的做法是转储ISO-8859-1并开始使用UTF-8. 也就是说，可以将任何通常允许的 Unicode 代码点编码为编码为ISO-8859-1. 您可以使用转义序列对它们进行编码，如下所示：

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

Example usage:

用法示例：

String foo = "This is Cyrillic Ya: \u044F\n"
    + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C“) is encoded as “. A couple of other arbitrary code points are likewise encoded.

上面，字符左双引号 ( U+201C“) 被编码为 “。其他几个任意代码点也被同样编码。

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

需要注意这种方法。如果您的文本需要为 HTML 进行转义，则需要在上述代码或与号最终被转义之前完成。

Answer 2

回答by ZZ Coder

Depending on your default encoding, following lines could cause problem,

根据您的默认编码，以下几行可能会导致问题，

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.

在 Java 中，字符串/字符始终采用 UTF-16BE。只有在将字符转换为字节时才涉及不同的编码。假设您的默认编码是 UTF-8，latin1缓冲区被视为 UTF-8，并且某些 Latin-1 序列可能会形成无效的 UTF-8 序列，您将得到 ?。

Answer 3

回答by fbaligand

when you instanciate your String object, you need to indicate which encoding to use.

当您实例化 String 对象时，您需要指明要使用的编码。

So replace :

所以替换：

return new String(latin1);

by

经过

return new String(latin1, "ISO-8859-1");

Answer 4

回答by robinst

With Java 8, McDowell's answercan be simplified like this (while preserving correct handling of surrogate pairs):

使用 Java 8，McDowell 的答案可以这样简化（同时保留对代理对的正确处理）：

public final class HtmlEncoder {
    private HtmlEncoder() {
    }

    public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
                                                          T out) throws java.io.IOException {
        for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
            int codePoint = iterator.nextInt();
            if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
                out.append((char) codePoint);
            } else {
                out.append("&#x");
                out.append(Integer.toHexString(codePoint));
                out.append(";");
            }
        }
        return out;
    }
}

在 Java 中将 UTF-8 转换为 ISO-8859-1

提问by Chocula

采纳答案by McDowell

回答by ZZ Coder

回答by fbaligand

回答by robinst

相关推荐

最近更新

标签

在 Java 中将 UTF-8 转换为 ISO-8859-1

提问by Chocula

采纳答案by McDowell

回答by ZZ Coder

回答by fbaligand

回答by robinst

相关推荐

多态地将 Java 枚举值转换为字符串列表

Java中的自然排序顺序字符串比较 - 是内置的吗？

Java HQL 左联接：预期加入的路径

java内存池是如何划分的？

相关推荐

最近更新

标签