php 什么是“ANSI as UTF-8”,如何让 fputcsv() 生成带 BOM 的 UTF-8?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1380690/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 02:17:07  来源:igfitidea点击:

What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?

phputf-8character-encodingnotepad++

提问by Petruza

I made a PHP script that generates CSV files that were previously generated by another process. And then, the CSV files have to be imported by yet another process.

我制作了一个 PHP 脚本,用于生成以前由另一个进程生成的 CSV 文件。然后,必须通过另一个进程导入 CSV 文件。

The import of the old CSV files works fine, but but when importing the new CSV files there are issues with special characters.

旧 CSV 文件的导入工作正常,但在导入新 CSV 文件时存在特殊字符问题。

When I open old CSVs with Notepad++, it says the encoding is UTF-8, and when I open the new CSVs with it, it says their encoding is 'ANSI as UTF-8'.

当我用 Notepad++ 打开旧的 CSV 时,它说编码是 UTF-8,当我用它打开新的 CSV 时,它说它们的编码是“ANSI as UTF-8”。

What's the difference of the two?

两者有什么区别?

And how can I make fopen and fputcsv use the 'pure?' UTF-8 encoding?

我怎样才能让 fopen 和 fputcsv 使用“纯”?UTF-8 编码?

Thanks!

谢谢!

回答by Alan Moore

There's nothing wrong with the file. "ANSI as UTF-8" means there's no BOM but Notepad++ has definitely identified the encoding as UTF-8 by analyzing byte patterns. I tested this by creating a file with Russian, Greek and Polish text in it and saving it as UTF-8 without a BOM. Here it is:

文件没有任何问题。“ANSI as UTF-8”意味着没有 BOM,但 Notepad++ 通过分析字节模式明确地将编码识别为 UTF-8。我通过创建一个包含俄语、希腊语和波兰语文本的文件并将其保存为没有 BOM 的 UTF-8 来测试这一点。这里是:

# Russian
Следующая

# Greek
Επ?μενη

# Polish
Wi?cej

I did this in a different editor (EditPad Pro) and used hex mode to make sure the BOM wasn't there. When I opened it in NPP it showed the encoding as "ANSI as UTF-8" and all of the characters displayed correctly. Then, still in hex mode, I removed the first byte of the first Russian character. When I opened it in NPP again, it showed the encoding as "ANSI" and displayed the non-ASCII parts of the text as mojibake:

我在不同的编辑器 (EditPad Pro) 中执行此操作并使用十六进制模式来确保 BOM 不在那里。当我在 NPP 中打开它时,它显示编码为“ANSI as UTF-8”并且所有字符都正确显示。然后,仍然处于十六进制模式,我删除了第一个俄语字符的第一个字节。当我再次在 NPP 中打开它时,它将编码显示为“ANSI”,并将文本的非 ASCII 部分显示为mojibake

; Russian
?D?DμD′?????‰D°?

; Greek
????????μ???·

; Polish
Wi??cej

Back to EditPad, and this time I added a BOM but didn't repair the Cyrillic character. This time NPP reported the encoding as "UTF-8" and everything displayed correctly except that first Russian character, as shown below. "A1" is the hex representation of what should have been the second byte of that character in UTF-8. It was displayed in an inverted color scheme to indicate an error.

回到EditPad,这次我添加了一个BOM但没有修复西里尔字符。这次 NPP 报告编码为“UTF-8”,除第一个俄语字符外,所有内容都正确显示,如下所示。“A1”是 UTF-8 中该字符的第二个字节的十六进制表示。它以反转的配色方案显示以指示错误。

# Russian
A1ледующая

# Greek
Επ?μενη

# Polish
Wi?cej

To summarize: In the absence of a BOM, Notepad++ looks for bytes that can't represent ASCII characters because their values are greater than 127 (or 7Fhex). If it finds any, but they all conform to the patterns required by UTF-8, it decodes the file as UTF-8 and reports the encoding in the status bar as "ANSI as UTF-8".

总结一下:在没有 BOM 的情况下,Notepad++ 会查找不能表示 ASCII 字符的字节,因为它们的值大于 127(或7F十六进制)。如果找到任何,但它们都符合UTF-8所需的模式,它会将文件解码为 UTF-8 并将状态栏中的编码报告为“ANSI as UTF-8”。

But if it finds even one byte that doesn't toe the UTF-8 line, it decodes the file as "ANSI", meaning the default single-byte encoding for the underlying platform. If your file had been corrupted, that's what you would be seeing.

但是,如果它发现甚至一个字节不符合 UTF-8 行,它就会将文件解码为“ANSI”,这意味着底层平台的默认单字节编码。如果您的文件已损坏,这就是您所看到的。

EDIT: Although your file is valid without it, you couldadd a BOM by manually writing the three bytes "EF BB BF"at the very beginning of the file--but there should be a better way. How are you generating the content now? Because it isUTF-8, with at least one non-ASCII character in there somewhere; otherwise, NPP would report it as "ANSI".

编辑:虽然您的文件在没有它的情况下有效,但您可以通过"EF BB BF"在文件的最开头手动写入三个字节添加 BOM——但应该有更好的方法。您现在如何生成内容?因为它UTF-8,在某处至少有一个非 ASCII 字符;否则,NPP 会将其报告为“ANSI”。

Another possibility to consider: if you have any influence over the process that consumes your CSV file, maybe you can configure it to expect UTF-8 without a BOM. Technically, any software that can decode UTF-8 witha BOM but not withoutone is broken. The Unicode Consortium actually discourages use of the UTF-8 BOM, not that anyone's listening.

要考虑的另一种可能性:如果您对使用 CSV 文件的过程有任何影响,也许您可​​以将其配置为无需 BOM 的 UTF-8。从技术上讲,任何可以使用BOM解码 UTF-8但并非没有BOM 的软件都已损坏。Unicode 联盟实际上不鼓励使用 UTF-8 BOM,而不是任何人都在听。

回答by Henrik Opel

According to the Notepad++ related threads hereand here, 'ANSI as UTF-8' indicates UTF-8 withoutBOM, while a plain 'UTF-8' means UTF-8 with BOM. So maybe the process reading the CSV needs the Byte-order markto correctly read the CSV as UTF-8.

根据此处此处的 Notepad++ 相关线程,“ANSI as UTF-8”表示不带BOM 的UTF-8 ,而普通的“UTF-8”表示带 BOM 的 UTF-8。因此,读取 CSV 的过程可能需要字节顺序标记才能将 CSV 正确读取为 UTF-8。

But before going into that, make sure that your script actually writes UTF-8! When you open the new CSVs in Notepad++ (and it says 'ANSI as UTF-8'), are all 'special' characters displayed correctly? If not, you need to adapt your script to actually write UTF-8, if yes, check for the BOM difference.

但在开始之前,请确保您的脚本实际上编写的是 UTF-8!当您在 Notepad++ 中打开新的 CSV(并显示“ANSI as UTF-8”)时,所有“特殊”字符是否都正确显示?如果不是,则需要调整脚本以实际编写 UTF-8,如果是,请检查 BOM 差异。

回答by Havenard

Try changing your PHP script to UTF-8 too. Sometimes it is necessary (despite it can be bypassed) to have the script in the same char encoding of the data.

也尝试将您的 PHP 脚本更改为 UTF-8。有时有必要(尽管它可以被绕过)让脚本使用相同的数据字符编码。

Similar problem: PHP: Explode using special characters

类似问题:PHP: Explode using special characters

回答by icc97

It is worth noting that ANSI as UTF-8, i.e. UTF-8 without the BOM is useful if you are formatting your PHP files as UTF-8. If your PHP file is outputting html to the browser then the BOM is included in the HTML output which the w3c validatorexplicitly warns against:

值得注意的是,ANSI 为 UTF-8,即没有 BOM 的 UTF-8,如果您将 PHP 文件格式化为 UTF-8,则非常有用。如果您的 PHP 文件将 html 输出到浏览器,那么 BOM 将包含在w3c 验证器明确警告的 HTML 输出中:

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.

在 UTF-8 文件中找到的字节顺序标记。

已知 UTF-8 编码文件中的 Unicode 字节顺序标记 (BOM) 会导致某些文本编辑器和旧浏览器出现问题。您可能需要考虑避免使用它,直到它得到更好的支持。

Further to this, I spotted that the BOM confuses Firefox's Firebug which now thinks that all your <head>content is actually in the <body>tag.

此外,我发现 BOM 混淆了 Firefox 的 Firebug,它现在认为您的所有<head>内容实际上都在<body>标签中。