Java 编写没有 BOM 的 UTF-8

Question

提问by Mawia

This code,

这段代码，

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());

And this,

和这个，

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));

produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

产生相同的结果（在我看来），即没有 BOM 的 UTF-8。但是，Notepad++ 没有显示任何有关 encoding 的信息。我希望记事本++在此处显示为Encode in UTF-8 without BOM，但在“编码”菜单中未选择任何编码。

Now, this code write the file in UTF-8 with BOM encoding.

现在，此代码使用 BOM 编码以 UTF-8 格式写入文件。

 OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
 byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
 out.write(bom);
 out.write("A".getBytes());

Notepad++ is also displaying the encoding type as Encode in UTF-8.

Notepad++ 还将编码类型显示为Encode in UTF-8.

Question:What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

问题：前两个代码有什么问题，它们假设在没有 BOM 的情况下以 UTF-8 格式写入文件？我的 Java 代码做对了吗？如果是这样，notepad++ 尝试检测编码类型是否有问题？

Is notepad++ only guessing around?

记事本++只是在猜测吗？

Answer 1

采纳答案by Joachim Sauer

"A" written using UTF-8 without a BOM produces exactlythe same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

使用没有 BOM 的 UTF-8编写的“A”与使用 ASCII 或 ISO-8859-* 或任何其他 ASCII 兼容编码编写的“A”产生的文件完全相同。该文件包含一个十进制值为 65 的字节。

Think of it this way:

可以这样想：

"A".getBytes("UTF-8")returns a new byte[] { 65 }
"A".getBytes("ISO-8859-1")returns a new byte[] { 65 }
You write the results of those calls into a file
How is the consumer of the file supposed to distinguish the two?

"A".getBytes("UTF-8")返回一个 new byte[] { 65 }
"A".getBytes("ISO-8859-1")返回一个 new byte[] { 65 }
您将这些调用的结果写入文件
文件的使用者应该如何区分这两者？

There's nothingin that file that suggests that UTF-8 needs to be used to decode it.

有没有在该文件中表明，UTF-8需要被用来解码。

Try writing "K?sekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tellsit which encoding to use).

尝试编写“K?sekuchen”或其他不能用 ASCII 编码的东西，看看 Notepad++ 是否正确猜测了编码（因为这正是它所做的：它进行了有根据的猜测，没有元数据告诉它使用哪种编码）。

Answer 2

回答by HookUp

I do not know if my answer is correct but let me put my understanding here,

我不知道我的回答是否正确，但让我在这里表达我的理解，

As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below

如上所述，如果您只写“A”，notepad++ 无法理解它是哪种编码类型，但是如果您希望 notepad++ 显示“Encode in UTF-8 without BOM”，如下图所示

enter image description here

在此处输入图片说明

Then you must fool Notepad++ which you can do it using following piece of code enter image description here

然后你必须愚弄 Notepad++，你可以使用以下代码来做到这一点在此处输入图片说明

If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.

如果您希望 notepad++ 显示“以 UTF-8 编码”，那么您应该从 osw.write("\uFEFF") 中删除子字符串部分，因为这是您尝试插入的 BOM 字符。当您插入此字符时，文件编码类型将变为“编码为 UTF-8”，当您以编程方式删除时，它将变为“在没有 BOM 的情况下以 UTF-8 编码”，因为您已删除此 BOM 字符。

Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.

您必须做的另一个设置是更改 Notepad++ 的首选项，如下所示，这样做只会使 Notepad++ 能够识别您想要的编码。

enter image description here

在此处输入图片说明

However if you simply write text it would be treated as "ANSI" by notepad++.

但是，如果您只是编写文本，notepad++ 会将其视为“ANSI”。

Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.

希望我的解释清楚，我的分析会对某人有所帮助。然而，这种方法是一种解决方法，不建议使用，但在无助的情况下这是有效的。

If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM"then you must do something like this,

如果您不希望更改 Notepad++ 首选项，并且仍然希望编码为“在没有 BOM 的情况下以 UTF-8 编码”，那么您必须执行以下操作，

enter image description here

在此处输入图片说明

I have explained samething probably in a better way in my blog here

我在我的博客中可能以更好的方式解释了同样的事情

Java 编写没有 BOM 的 UTF-8

提问by Mawia

采纳答案by Joachim Sauer

回答by HookUp

相关推荐

最近更新

标签

Java 编写没有 BOM 的 UTF-8

提问by Mawia

采纳答案by Joachim Sauer

回答by HookUp

相关推荐

如何在java中打开一个exe文件

Java 8：Lambda-Streams，按带有异常的方法过滤

Java 运行 JAR 时出现 ClassNotFoundException，在 IntelliJ IDEA 中运行时没有错误

java中的有向图

相关推荐

最近更新

标签