Python 拆分函数添加:\xef\xbb\xbf...\n 到我的列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18664712/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:22:03  来源:igfitidea点击:

Split function add: \xef\xbb\xbf...\n to my list

pythonsplit

提问by Michael

I want to open my file.txtand split all data from this file.

我想打开我的file.txt并从这个文件中拆分所有数据。

Here is my file.txt:

这是我的file.txt

some_data1 some_data2 some_data3 some_data4 some_data5

and here is my python code:

这是我的python代码:

>>>file_txt = open("file.txt", 'r')
>>>data = file_txt.read()
>>>data_list = data.split(' ')
>>>print data
some_data1 some_data2 some_data3 some_data4 some_data5
>>>print data_list
['\xef\xbb\xbfsome_data1', 'some_data1', "some_data1", 'some_data1', 'some_data1\n']

As you can see here, when I print my data_listit adds to my list this: \xef\xbb\xbfand this: \n. What are these and how can I clean my list from them.

正如您在此处看到的,当我打印 my 时,data_list它会将 this:\xef\xbb\xbf和 this:添加到我的列表中\n。这些是什么以及我如何从它们中清除我的列表。

Thanks.

谢谢。

采纳答案by warvariuc

Your file contains UTF-8 BOMin the beginning.

您的文件开头包含UTF-8 BOM

To get rid of it, first decode your file contents to unicode.

要摆脱它,首先将您的文件内容解码为 un​​icode。

fp = open("file.txt")
data = fp.read().decode("utf-8-sig").encode("utf-8")

But better don't encode it back to utf-8, but work with unicoded text. There is a good rule: decode all your input text data to unicode as soon as possible, and work only with unicode; and encode the output data to the required encoding as late as possible. This will save you from many headaches.

但最好不要将其编码回utf-8, 而是使用unicoded 文本。有一个很好的规则:尽快将所有输入的文本数据解码为 un​​icode,并且只使用 unicode;并尽可能晚地将输出数据编码为所需的编码。这将使您免于许多头痛。

To read bigger files in a certain encoding, use io.openor codecs.open.

要以某种编码读取更大的文件,请使用io.opencodecs.open

Also check this.

也检查这个

Use str.strip()or str.rstrip()to get rid of the newline character \n.

使用str.strip()str.rstrip()去除换行符\n

回答by thegrinner

The \xef\xbb\xbfis a Byte Order Markfor UTF-8 - the \xis an escape sequenceindicating the next two characters are a hex sequence representing the character code.

\xef\xbb\xbf是UTF-8的字节顺序标记- 这\x是一个转义序列,指示接下来的两个字符是表示字符代码的十六进制序列。

The \nis a new line character. To remove this, you can use rstrip().

\n是换行字符。要删除它,您可以使用rstrip().

data.rstrip()
data_list = data.split(' ')

To remove the byte order mark, you can use io.open(assuming you're using 2.6 or 2.7) to open the file in utf-8mode. Note that can be a bit slower as it's implemented in Python - if speed or older versions of Python are necessary, take a look at codecs.open.

要删除字节顺序标记,您可以使用io.open(假设您使用的是 2.6 或 2.7)以utf-8模式打开文件。请注意,它可能会慢一点,因为它是在 Python 中实现的 - 如果需要速度或旧版本的 Python,请查看codecs.open.

Try something like this:

尝试这样的事情:

import io

# Make sure we don't lose the list when we close the file
data_list = []

# Use `with` to ensure the file gets cleaned up properly
with io.open('file.txt', 'r', encoding='utf-8') as file:
    data = file.read() # Be careful when using read() with big files
    data.rstrip() # Chomp the newline character
    data_list = data.split(' ')

print data_list

回答by jeromej

As the others mentioned, you are dealing with a file that contains UTF-8 BOM at its beginning.

正如其他人提到的,您正在处理一个开头包含 UTF-8 BOM 的文件。

They all tell you how to deal with it or removing it directly.

他们都告诉你如何处理或直接删除它。

BUT, if you do happen to have to work with only one static file (or a small static set of them), you may wish to actively remove the BOM altogether so you simply don't have to deal with it.

但是,如果您碰巧只需要处理一个静态文件(或其中的一小部分静态文件),您可能希望主动完全删除 BOM,这样您就不必处理它了。

As a matter of fact, most text editors will allow you to convert from one encoding to another and sometimes UTF-8 and UTF-8 with BOM are listed separately.

事实上,大多数文本编辑器都允许您从一种编码转换为另一种编码,有时 UTF-8 和带 BOM 的 UTF-8 会分开列出。

The first that comes to my mind (but there is many) is Notepad++. Simply go in Encoding > Convert to UTF-8 without BOM, save the file and you are set.

我想到的第一个(但有很多)是 Notepad++。只需进入编码 > 转换为不带 BOM 的 UTF-8,保存文件即可。