Python 字符串中的 u'\ufeff'
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17912307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
u'\ufeff' in Python string
提问by James Hallen
I get an error with the following patter:
我收到以下模式的错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)
Not sure what u'\ufeff'
is, it shows up when I'm web scraping. How can I remedy the situation? The .replace()
string method doesn't work on it.
不知道是什么u'\ufeff'
,当我抓取网页时它会出现。我该如何补救?该.replace()
字符串的方法不能进行这项工作。
采纳答案by Mark Tolonen
The Unicode character U+FEFF
is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:
Unicode 字符U+FEFF
是字节顺序标记或 BOM,用于区分大端和小端 UTF-16 编码。如果您使用正确的编解码器解码网页,Python 会为您删除它。例子:
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
Note that EF BB BF
is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).
请注意,这EF BB BF
是一个 UTF-8 编码的 BOM。UTF-8 不需要它,它仅用作签名(通常在 Windows 上)。
Output:
输出:
utf-8 'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'A\x00B\x00C\x00'
utf-16be '\x00A\x00B\x00C'
utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.
Note that the utf-16
codec requiresBOM to be present, or Python won't know if the data is big- or little-endian.
请注意,utf-16
编解码器要求存在 BOM,否则 Python 将不知道数据是大端还是小端。
回答by swstephe
That character is the BOMor "Byte Order Mark". It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to 'ascii', you should probably pick another encoding for whatever you were trying to do.
该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收,告诉您如何解释其余数据的编码。您可以简单地删除字符以继续。虽然,由于错误表明您正在尝试转换为 'ascii',您可能应该为您尝试做的任何事情选择另一种编码。
回答by theodox
The content you're scraping is encoded in unicode rather than ascii text, and you're getting a character that doesn't convert to ascii. The right 'translation' depends on what the original web page thought it was. Python's unicode pagegives the background on how it works.
您抓取的内容以 unicode 而不是 ascii 文本编码,并且您得到的字符不会转换为 ascii。正确的“翻译”取决于原始网页的想法。 Python 的 unicode 页面提供了它如何工作的背景。
Are you trying to print the result or stick it in a file? The error suggests it's writingthe data that's causing the problem, not reading it. This questionis a good place to look for the fixes.
您是要打印结果还是将其粘贴到文件中?该错误表明它正在写入导致问题的数据,而不是读取它。这个问题是寻找修复的好地方。
回答by Jagdish Chauhan
This problem arise basically when you save your python code in a UTF-8 or UTF-16 encodingbecause python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read()function you can see at the begin of the returned code '\ufeff'is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding... Hope this will help.
当您以UTF-8 或 UTF-16 编码保存 Python 代码时,基本上会出现此问题,因为 Python 会自动在代码开头添加一些特殊字符(文本编辑器未显示)以识别编码格式。但是,当您尝试执行代码时,它会给您第 1 行中的语法错误,即代码开头,因为python 编译器了解 ASCII 编码。当您使用read()函数查看文件代码时,您可以在返回代码的开头看到“\ufeff”。解决这个问题的一个最简单的方法就是将编码改回 ASCII 编码(为此,您可以将代码复制到记事本并保存。记住!选择 ASCII 编码...希望这会有所帮助。
回答by siebz0r
I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.
我在 Python 3 上遇到了这个问题,发现了这个问题(和解决方案)。打开文件时,Python 3 支持 encoding 关键字来自动处理编码。
Without it, the BOM is included in the read result:
没有它,BOM 将包含在读取结果中:
>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'
Giving the correct encoding, the BOM is omitted in the result:
给出正确的编码,结果中省略了 BOM:
>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'
Just my 2 cents.
只有我的 2 美分。
回答by caot
Here is based on the answer from Mark Tolonen. The string included different languages of the word 'test' that's separated by '|', so you can see the difference.
这是基于 Mark Tolonen 的回答。该字符串包含用“|”分隔的单词“test”的不同语言,因此您可以看到差异。
u = u'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print('utf-8 %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16 %r' % e16)
print('utf-16le %r' % e16le)
print('utf-16be %r' % e16be)
print()
print('utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8'))
print('utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le'))
Here is a test run:
这是一个测试运行:
>>> u = u'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> e8 = u.encode('utf-8') # encode without BOM
>>> e8s = u.encode('utf-8-sig') # encode with BOM
>>> e16 = u.encode('utf-16') # encode with BOM
>>> e16le = u.encode('utf-16le') # encode without BOM
>>> e16be = u.encode('utf-16be') # encode without BOM
>>> print('utf-8 %r' % e8)
utf-8 b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16 %r' % e16)
utf-16 b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le %r' % e16le)
utf-16le b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be %r' % e16be)
utf-16be b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()
>>> print('utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8'))
utf-8 w/ BOM decoded with utf-8 '\ufeffABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8 w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16 'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le '\ufeffABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
It's worth to know that only both utf-8-sig
and utf-16
get back the original string after both encode
and decode
.
值得知道的是,只有 both utf-8-sig
andutf-16
才能在encode
and之后取回原始字符串decode
。