如何在 Java 中查找默认字符集/编码？

Question

提问by ZZ Coder

The obvious answer is to use Charset.defaultCharset()but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

显而易见的答案是使用，Charset.defaultCharset()但我们最近发现这可能不是正确的答案。有人告诉我，结果与 java.io 类在几次使用的实际默认字符集不同。看起来 Java 保留了 2 组默认字符集。有没有人对这个问题有任何见解？

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

我们能够重现一个失败案例。这是一种用户错误，但它可能仍会暴露所有其他问题的根本原因。这是代码，

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

我们的服务器需要 Latin-1 中的默认字符集来处理旧协议中的一些混合编码 (ANSI/Latin-1/UTF-8)。所以我们所有的服务器都使用这个 JVM 参数运行，

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

这是 Java 5 的结果，

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

有人试图通过在代码中设置 file.encoding 来更改编码运行时。我们都知道那是行不通的。但是，这显然会抛出 defaultCharset() ，但它不会影响 OutputStreamWriter 使用的实际默认字符集。

Is this a bug or feature?

这是错误还是功能？

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

编辑：接受的答案显示了问题的根本原因。基本上，您不能信任 Java 5 中的 defaultCharset()，它不是 I/O 类使用的默认编码。看起来 Java 6 纠正了这个问题。

Answer 1

采纳答案by bruno conde

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding"property with System.setProperty("file.encoding", "Latin-1");does nothing. Every time Charset.defaultCharset()is called it returns the cached charset.

这真的很奇怪......一旦设置，默认的 Charset 就会被缓存，并且当类在内存中时它不会改变。设置"file.encoding"属性System.setProperty("file.encoding", "Latin-1");没有任何作用。每次Charset.defaultCharset()调用它都会返回缓存的字符集。

Here are my results:

这是我的结果：

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

不过我使用的是 JVM 1.6。

(update)

（更新）

Ok. I did reproduce your bug with JVM 1.5.

好的。我确实用 JVM 1.5 重现了您的错误。

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

查看 1.5 的源代码，未设置缓存的默认字符集。我不知道这是否是一个错误，但 1.6 更改了此实现并使用了缓存字符集：

JVM 1.5:

JVM 1.5：

public static Charset defaultCharset() {
    synchronized (Charset.class) {
        if (defaultCharset == null) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                return cs;
            return forName("UTF-8");
        }
        return defaultCharset;
    }
}

JVM 1.6:

JVM 1.6：

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

当您将文件编码设置file.encoding=Latin-1为下次调用时Charset.defaultCharset()，会发生什么，因为缓存的默认字符集未设置，它将尝试为 name 找到合适的字符集Latin-1。找不到此名称，因为它不正确，并返回默认的UTF-8.

As for why the IO classes such as OutputStreamWriterreturn an unexpected result,
the implementation of sun.nio.cs.StreamEncoder(witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset()method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName();to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:

至于为什么 IO 类如OutputStreamWriter返回意外结果，JVM 1.5 和 JVM 1.6
的sun.nio.cs.StreamEncoder（这些 IO 类使用了 witch）的实现是不同的。JVM 1.6 实现基于Charset.defaultCharset()获取默认编码的方法，如果没有提供给 IO 类。JVM 1.5 实现使用不同的方法Converters.getDefaultEncodingName();来获取默认字符集。此方法使用其自己的默认字符集缓存，该缓存在 JVM 初始化时设置：

JVM 1.6:

JVM 1.6：

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Charset.defaultCharset().name();
    try {
        if (Charset.isSupported(csn))
            return new StreamEncoder(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
    throw new UnsupportedEncodingException (csn);
}

JVM 1.5:

JVM 1.5：

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Converters.getDefaultEncodingName();
    if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
        try {
            if (Charset.isSupported(csn))
                return new CharsetSE(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
    }
    return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

但我同意这些评论。你不应该依赖这个属性。这是一个实现细节。

Answer 2

回答by McDowell

Is this a bug or feature?

这是错误还是功能？

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

看起来像未定义的行为。我知道，在实践中，您可以使用命令行属性更改默认编码，但我认为没有定义这样做时会发生什么。

Bug ID: 4153515on problems setting this property:

关于设置此属性的问题的错误 ID：4153515：

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

这不是一个错误。J2SE 平台规范不需要“file.encoding”属性；它是 Sun 实现的内部细节，不应由用户代码检查或修改。它也是只读的；技术上不可能支持在命令行上或在程序执行期间的任何其他时间将此属性设置为任意值。
更改 VM 和运行时系统使用的默认编码的首选方法是在启动 Java 程序之前更改底层平台的语言环境。

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

当我看到人们在命令行上设置编码时，我感到害怕 - 你不知道会影响哪些代码。

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

如果您不想使用默认编码，请通过适当的方法/构造函数显式设置您想要的编码。

Answer 3

回答by Sean Owen

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

首先，Latin-1 与 ISO-8859-1 相同，因此，默认值对您来说已经可以了。对？

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

您使用命令行参数成功地将编码设置为 ISO-8859-1。您还以编程方式将其设置为“Latin-1”，但是，这不是 Java 文件编码的公认值。请参阅http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

当你这样做时，从查看源代码来看，字符集似乎重置为 UTF-8。这至少解释了大部分行为。

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

我不知道为什么 OutputStreamWriter 显示 ISO8859_1。它委托给闭源 sun.misc.* 类。我猜它并没有完全通过相同的机制处理编码，这很奇怪。

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

但当然，您应该始终在此代码中指定您的意思是什么编码。我永远不会依赖平台默认值。

Answer 4

回答by jarnbjo

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

这种行为并没有那么奇怪。查看类的实现，它是由以下原因引起的：

Charset.defaultCharset()is not caching the determined character set in Java 5.
Setting the system property "file.encoding" and invoking Charset.defaultCharset()again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset()defaults to "UTF-8".
The OutputStreamWriteris however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset()if the system property "file.encoding" has been changed at runtime.

Charset.defaultCharset()不缓存 Java 5 中确定的字符集。
设置系统属性“file.encoding”并Charset.defaultCharset()再次调用会导致对系统属性进行第二次评估，找不到名称为“Latin-1”的字符集，因此Charset.defaultCharset()默认为“UTF-8”。
OutputStreamWriter然而，它正在缓存默认字符集，并且可能已经在 VM 初始化期间使用，因此Charset.defaultCharset()如果系统属性“file.encoding”在运行时已更改，则其默认字符集会转移。

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset()API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

正如已经指出的那样，没有记录 VM 在这种情况下必须如何表现。该Charset.defaultCharset()API文档是不是默认字符集是如何确定的，只提的是，它通常是做了VM启动的基础上，如OS默认字符集或默认的语言环境因素非常精确。

Answer 5

回答by neoedmund

check

查看

System.getProperty("sun.jnu.encoding")

it seems to be the same encoding as the one used in your system's command line.

它似乎与系统命令行中使用的编码相同。

Answer 6

回答by Davy Jones

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

我已将 WAS 服务器中的 vm 参数设置为 -Dfile.encoding=UTF-8 以更改服务器的默认字符集。

如何在 Java 中查找默认字符集/编码？

提问by ZZ Coder

采纳答案by bruno conde

回答by McDowell

回答by Sean Owen

回答by jarnbjo

回答by neoedmund

回答by Davy Jones

相关推荐

最近更新

标签

如何在 Java 中查找默认字符集/编码？

提问by ZZ Coder

采纳答案by bruno conde

回答by McDowell

回答by Sean Owen

回答by jarnbjo

回答by neoedmund

回答by Davy Jones

相关推荐

Java 生成字符串排列组合的智能方法

Java 如何创建密码？

java - 拆分字符串后，数组中的第一个元素是什么？

Java：高效计算大文件的 SHA-256 哈希

相关推荐

最近更新

标签