使用 Python Pandas 读取制表符分隔的文件

Question

提问by Vasilis Vasileiou

I have encountered a problem reading a tab separated file using Pandas.

我在使用 Pandas 读取制表符分隔文件时遇到问题。

All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:

所有单元格值都有双引号，但对于某些行，有一个额外的双引号会破坏整个过程。例如：

Column A  Column B  Column C
"foo1"    "121654"  "unit"
"foo2"    "1214"    "unit"
"foo3"    "15884""

The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58

我得到的错误是：错误标记数据。C 错误：第 8355 行预期有 31 个字段，看到 58 个

The code I used is:

我使用的代码是：

csv?= pd.read_csv(file,?sep='\t',??lineterminator='\n',?names=None)

and it works fine for the rest of the files but not for the ones where this extra double quotation appears.

它适用于其余文件，但不适用于出现此额外双引号的文件。

Answer 1

采纳答案by Jean-Fran?ois Fabre

If you cannot change the buggy input, the best way would be to read the input file into a io.StringIOobject, replacing the double quotes, then pass this file-like object to pd.read(it supports filenames and file-like objects)

如果您无法更改错误输入，最好的方法是将输入文件读入一个io.StringIO对象，替换双引号，然后将此类文件对象传递给pd.read（它支持文件名和类文件对象）

That way you don't have to create a temporary file or to alter the input data.

这样您就不必创建临时文件或更改输入数据。

import io

with open(file) as f:
    fileobject = io.StringIO(f.read().replace('""','"'))

csv = pd.read_csv(fileobject, sep='\t',  lineterminator='\n', names=None)

Answer 2

回答by taras

You can do the preprocessing step to fix the quotation issue:

您可以执行预处理步骤来解决报价问题：

with open(file, 'r') as fp:
    text = fp.read().replace('""', '"')

with open(file, 'w') as fp:
    fp.write(text)

使用 Python Pandas 读取制表符分隔的文件

提问by Vasilis Vasileiou

采纳答案by Jean-Fran?ois Fabre

回答by taras

相关推荐

最近更新

标签

使用 Python Pandas 读取制表符分隔的文件

提问by Vasilis Vasileiou

采纳答案by Jean-Fran?ois Fabre

回答by taras

相关推荐

Pandas DataFrame.apply：使用来自两列的数据创建新列

Python Pandas Dataframe 合并并仅选择几列

Scatter_Matrix 不会显示使用 Pandas 和

在 Pandas 数据框中的特定索引处插入新行

相关推荐

最近更新

标签