Python 使用 Pandas 读取制表符分隔的文件 - 适用于 Windows，但不适用于 Mac

Question

提问by user3062149

I've been reading a tab-delimited data file in Windows with Pandas/Python without any problems. The data file contains notes in first three lines and then follows with a header.

我一直在使用 Pandas/Python 在 Windows 中读取制表符分隔的数据文件，没有任何问题。数据文件的前三行包含注释，然后是标题。

df = pd.read_csv(myfile,sep='\t',skiprows=(0,1,2),header=(0))

I'm now trying to read this file with my Mac. (My first time using Python on Mac.) I get the following error.

我现在正在尝试用我的 Mac 读取这个文件。（我第一次在 Mac 上使用 Python。）我收到以下错误。

pandas.parser.CParserError: Error tokenizing data. C error: Expected 1
fields in line 8, saw 39

If set the error_bad_linesargument for read_csvto False, I get the following information, which continues until the end of the last row.

如果设置error_bad_lines的说法read_csv到假，我得到以下信息，这一直持续到最后一行的末尾。

Skipping line 8: expected 1 fields, saw 39
Skipping line 9: expected 1 fields, saw 125
Skipping line 10: expected 1 fields, saw 125
Skipping line 11: expected 1 fields, saw 125
Skipping line 12: expected 1 fields, saw 125
Skipping line 13: expected 1 fields, saw 125
Skipping line 14: expected 1 fields, saw 125
Skipping line 15: expected 1 fields, saw 125
Skipping line 16: expected 1 fields, saw 125
Skipping line 17: expected 1 fields, saw 125
...

Do I need to specify a value for the encodingargument? It seems as though I shouldn't have to because reading the file works fine on Windows.

我需要为encoding参数指定一个值吗？似乎我不应该这样做，因为在 Windows 上读取文件可以正常工作。

Answer 1

采纳答案by brad sanders

The biggest clue is the rows are all being returned on one line. This indicates line terminators are being ignored or are not present.

最大的线索是所有行都在一行上返回。这表示行终止符被忽略或不存在。

You can specify the line terminator for csv_reader. If you are on a mac the lines created will end with \rrather than the linux standard \nor better still the suspenders and belt approach of windows with \r\n.

您可以为 csv_reader 指定行终止符。如果你是在Mac上创建将结束行\r，而不是Linux标准\n或者更好的是有窗户的吊带和腰带的方法\r\n。

pandas.read_csv(filename, sep='\t', lineterminator='\r')

You could also open all your data using the codecs package. This may increase robustness at the expense of document loading speed.

您还可以使用 codecs 包打开所有数据。这可能会以牺牲文档加载速度为代价来提高鲁棒性。

import codecs

doc = codecs.open('document','rU','UTF-16') #open for reading with "universal" type set

df = pandas.read_csv(doc, sep='\t')

Answer 2

回答by user3479780

Another option would be to add engine='python'to the command pandas.read_csv(filename, sep='\t', engine='python')

另一种选择是添加engine='python'到命令pandas.read_csv(filename, sep='\t', engine='python')

Python 使用 Pandas 读取制表符分隔的文件 - 适用于 Windows，但不适用于 Mac

提问by user3062149

采纳答案by brad sanders

回答by user3479780

相关推荐

最近更新

标签

Python 使用 Pandas 读取制表符分隔的文件 - 适用于 Windows，但不适用于 Mac

提问by user3062149

采纳答案by brad sanders

回答by user3479780

相关推荐

从 Python 3 调用 Python 2 脚本

Python 日志文件配置 KeyError: 'formatters'

如何阻止 python RuntimeWarning 打印到终端？

如何在python中将文本编码为base64

相关推荐

最近更新

标签