使用 Python Pandas 读取制表符分隔的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44699123/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:51:19  来源:igfitidea点击:

Reading a tab separated file with Python Pandas

pythonpython-2.7pandas

提问by Vasilis Vasileiou

I have encountered a problem reading a tab separated file using Pandas.

我在使用 Pandas 读取制表符分隔文件时遇到问题。

All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:

所有单元格值都有双引号,但对于某些行,有一个额外的双引号会破坏整个过程。例如:

Column A  Column B  Column C
"foo1"    "121654"  "unit"
"foo2"    "1214"    "unit"
"foo3"    "15884""  

The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58

我得到的错误是:错误标记数据。C 错误:第 8355 行预期有 31 个字段,看到 58 个

The code I used is:

我使用的代码是:

csv?= pd.read_csv(file,?sep='\t',??lineterminator='\n',?names=None) 

and it works fine for the rest of the files but not for the ones where this extra double quotation appears.

它适用于其余文件,但不适用于出现此额外双引号的文件。

采纳答案by Jean-Fran?ois Fabre

If you cannot change the buggy input, the best way would be to read the input file into a io.StringIOobject, replacing the double quotes, then pass this file-like object to pd.read(it supports filenames and file-like objects)

如果您无法更改错误输入,最好的方法是将输入文件读入一个io.StringIO对象,替换双引号,然后将此类文件对象传递给pd.read(它支持文件名和类文件对象)

That way you don't have to create a temporary file or to alter the input data.

这样您就不必创建临时文件或更改输入数据。

import io

with open(file) as f:
    fileobject = io.StringIO(f.read().replace('""','"'))

csv = pd.read_csv(fileobject, sep='\t',  lineterminator='\n', names=None)

回答by taras

You can do the preprocessing step to fix the quotation issue:

您可以执行预处理步骤来解决报价问题:

with open(file, 'r') as fp:
    text = fp.read().replace('""', '"')

with open(file, 'w') as fp:
    fp.write(text)