带有 BOM 的 UTF-8 HTML 和 CSS 文件(以及如何使用 Python 删除 BOM)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2456380/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)
提问by Cameron
First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.
首先,一些背景知识:我正在使用 Python 开发一个 Web 应用程序。我的所有(文本)文件当前都以 UTF-8 格式存储,并带有 BOM。这包括我所有的 HTML 模板和 CSS 文件。这些资源作为二进制数据(BOM 和所有)存储在我的数据库中。
When I retrieve the templates from the DB, I decode them using template.decode('utf-8')
. When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:
当我从数据库中检索模板时,我使用template.decode('utf-8')
. 当 HTML 到达浏览器时,BOM 出现在 HTTP 响应正文的开头。这会在 Chrome 中产生一个非常有趣的错误:
Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.
Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.
Chrome seems to generate an <html>
tag automatically when it sees the BOM and mistakes it for content, making the real <html>
tag an error.
当 Chrome<html>
看到 BOM 并将其误认为内容时,它似乎会自动生成一个标签,从而使真正的<html>
标签成为错误。
So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?
那么,使用 Python,从我的 UTF-8 编码模板中删除 BOM 的最佳方法是什么(如果它存在——我将来不能保证这一点)?
For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8')
.
对于其他基于文本的文件,如 CSS,主流浏览器是否会正确解释(或忽略)BOM?它们作为纯二进制数据发送,没有.decode('utf-8')
.
Note: I am using Python 2.5.
注意:我使用的是 Python 2.5。
Thanks!
谢谢!
回答by Mark Tolonen
Since you state:
既然你说:
All of my (text) files are currently stored in UTF-8 with the BOM
我的所有(文本)文件目前都以 UTF-8 格式存储,并带有 BOM
then use the 'utf-8-sig' codec to decode them:
然后使用“utf-8-sig”编解码器对它们进行解码:
>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'
It automatically removes the expected BOM, and works correctly if the BOM is not present as well.
它会自动删除预期的 BOM,如果 BOM 也不存在,它也能正常工作。
回答by Ignacio Vazquez-Abrams
Check the first character after decoding to see if it's the BOM:
解码后检查第一个字符是否是BOM:
if u.startswith(u'\ufeff'):
u = u[1:]
回答by John Machin
The previously-accepted answer is WRONG.
先前接受的答案是错误的。
u'\ufffe'
is not a character. If you get it in a unicode string somebody has stuffed up mightily.
u'\ufffe'
不是一个字符。如果你把它放在一个 unicode 字符串中,那么有人已经把它塞满了。
The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'
BOM(又名零宽度无间断空间)是 u'\ufeff'
>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>
Read this(Ctrl-F search for BOM) and thisand this(Ctrl-F search for BOM).
阅读这个(Ctrl-F 搜索 BOM)和这个和这个(Ctrl-F 搜索 BOM)。
Here's a correct and typo/braino-resistant answer:
这是一个正确的和拼写错误/抗脑力的答案:
Decode your input into unicode_str
. Then do this:
将您的输入解码为unicode_str
. 然后这样做:
# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
unicode_str = unicode_str[1:]
Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.
额外奖励:使用命名常量比使用一系列看似任意的六字形符号可以让读者更多地了解正在发生的事情。
UpdateUnfortunately there seems to be no suitable named constant in the standard Python library.
更新不幸的是,标准 Python 库中似乎没有合适的命名常量。
Alas, the codecs module provides only "a snare and a delusion":
唉,编解码器模块只提供了“一个圈套和一个妄想”:
>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'), #### aarrgghh!! ####
('BOM32_BE', '\xfe\xff'),
('BOM32_LE', '\xff\xfe'),
('BOM64_BE', '\x00\x00\xfe\xff'),
('BOM64_LE', '\xff\xfe\x00\x00'),
('BOM_BE', '\xfe\xff'),
('BOM_LE', '\xff\xfe'),
('BOM_UTF16', '\xff\xfe'),
('BOM_UTF16_BE', '\xfe\xff'),
('BOM_UTF16_LE', '\xff\xfe'),
('BOM_UTF32', '\xff\xfe\x00\x00'),
('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
('BOM_UTF8', '\xef\xbb\xbf')]
>>>
Update 2If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWOdifferent BOMs for UTF-16 and at least TWOdifferent BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?
更新 2如果您还没有解码您的输入,并希望检查它的 BOM,您需要检查两个不同的 UTF-16 的 BOM 和至少两个不同的 UTF-32 的 BOM。如果每种方式只有一种方式,那么您就不需要 BOM,对吗?
Here verbatim unprettified from my own code is my solution to this:
这里逐字从我自己的代码中未经修饰是我的解决方案:
def check_for_bom(s):
bom_info = (
('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
('\xEF\xBB\xBF', 3, 'UTF-8'),
('\xFF\xFE', 2, 'UTF-16LE'),
('\xFE\xFF', 2, 'UTF-16BE'),
)
for sig, siglen, enc in bom_info:
if s.startswith(sig):
return enc, siglen
return None, 0
The input s
should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).
输入s
应该至少是输入的前 4 个字节。它返回可用于解码输入的后 BOM 部分的编码,以及 BOM 的长度(如果有)。
If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.
如果您是偏执狂,您可以允许另外 2 个(非标准)UTF-32 排序,但 Python 不为它们提供编码,而且我从未听说过实际发生的情况,所以我不打扰。
回答by pajton
You can use something similar to remove BOM:
您可以使用类似的东西来删除 BOM:
import os, codecs
def remove_bom_from_file(filename, newfilename):
if os.path.isfile(filename):
# open file
f = open(filename,'rb')
# read first 4 bytes
header = f.read(4)
# check if we have BOM...
bom_len = 0
encodings = [ ( codecs.BOM_UTF32, 4 ),
( codecs.BOM_UTF16, 2 ),
( codecs.BOM_UTF8, 3 ) ]
# ... and remove appropriate number of bytes
for h, l in encodings:
if header.startswith(h):
bom_len = l
break
f.seek(0)
f.read(bom_len)
# copy the rest of file
contents = f.read()
nf = open(newfilename)
nf.write(contents)
nf.close()