Java 编写没有 BOM 的 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19768763/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Writing UTF-8 without BOM
提问by Mawia
This code,
这段代码,
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());
And this,
和这个,
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));
produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM
, but no encoding is being selected in the "Encoding" menu.
产生相同的结果(在我看来),即没有 BOM 的 UTF-8。但是,Notepad++ 没有显示任何有关 encoding 的信息。我希望记事本++在此处显示为Encode in UTF-8 without BOM
,但在“编码”菜单中未选择任何编码。
Now, this code write the file in UTF-8 with BOM encoding.
现在,此代码使用 BOM 编码以 UTF-8 格式写入文件。
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
out.write(bom);
out.write("A".getBytes());
Notepad++ is also displaying the encoding type as Encode in UTF-8
.
Notepad++ 还将编码类型显示为Encode in UTF-8
.
Question:What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?
问题:前两个代码有什么问题,它们假设在没有 BOM 的情况下以 UTF-8 格式写入文件?我的 Java 代码做对了吗?如果是这样,notepad++ 尝试检测编码类型是否有问题?
Is notepad++ only guessing around?
记事本++只是在猜测吗?
采纳答案by Joachim Sauer
"A" written using UTF-8 without a BOM produces exactlythe same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.
使用没有 BOM 的 UTF-8编写的“A”与使用 ASCII 或 ISO-8859-* 或任何其他 ASCII 兼容编码编写的“A”产生的文件完全相同。该文件包含一个十进制值为 65 的字节。
Think of it this way:
可以这样想:
"A".getBytes("UTF-8")
returns anew byte[] { 65 }
"A".getBytes("ISO-8859-1")
returns anew byte[] { 65 }
- You write the results of those calls into a file
- How is the consumer of the file supposed to distinguish the two?
"A".getBytes("UTF-8")
返回一个new byte[] { 65 }
"A".getBytes("ISO-8859-1")
返回一个new byte[] { 65 }
- 您将这些调用的结果写入文件
- 文件的使用者应该如何区分这两者?
There's nothingin that file that suggests that UTF-8 needs to be used to decode it.
有没有在该文件中表明,UTF-8需要被用来解码。
Try writing "K?sekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tellsit which encoding to use).
尝试编写“K?sekuchen”或其他不能用 ASCII 编码的东西,看看 Notepad++ 是否正确猜测了编码(因为这正是它所做的:它进行了有根据的猜测,没有元数据告诉它使用哪种编码)。
回答by HookUp
I do not know if my answer is correct but let me put my understanding here,
我不知道我的回答是否正确,但让我在这里表达我的理解,
As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below
如上所述,如果您只写“A”,notepad++ 无法理解它是哪种编码类型,但是如果您希望 notepad++ 显示“Encode in UTF-8 without BOM”,如下图所示
Then you must fool Notepad++ which you can do it using following piece of code
然后你必须愚弄 Notepad++,你可以使用以下代码来做到这一点
If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.
如果您希望 notepad++ 显示“以 UTF-8 编码”,那么您应该从 osw.write("\uFEFF") 中删除子字符串部分,因为这是您尝试插入的 BOM 字符。当您插入此字符时,文件编码类型将变为“编码为 UTF-8”,当您以编程方式删除时,它将变为“在没有 BOM 的情况下以 UTF-8 编码”,因为您已删除此 BOM 字符。
Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.
您必须做的另一个设置是更改 Notepad++ 的首选项,如下所示,这样做只会使 Notepad++ 能够识别您想要的编码。
However if you simply write text it would be treated as "ANSI" by notepad++.
但是,如果您只是编写文本,notepad++ 会将其视为“ANSI”。
Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.
希望我的解释清楚,我的分析会对某人有所帮助。然而,这种方法是一种解决方法,不建议使用,但在无助的情况下这是有效的。
If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM"then you must do something like this,
如果您不希望更改 Notepad++ 首选项,并且仍然希望编码为“在没有 BOM 的情况下以 UTF-8 编码”,那么您必须执行以下操作,
I have explained samething probably in a better way in my blog here
我在我的博客中可能以更好的方式解释了同样的事情